Differences

This shows you the differences between two versions of the page.

--- sd:sd-8516_ppu [2026/04/25 04:47] – appledog
+++ sd:sd-8516_ppu [2026/04/25 14:36] (current) – appledog
@@ Line 1: / Line 1: @@
 = SD-8516 PPU
 This is a short reference to the XY-2000 PPU.
+== Introduction
+The following shows the speed tests and design decisions that went into the XY-2000 PPU.
 == PPIXEL (plot_pixel)
 ^ Method ^ Pixels per Second ^
-| BASIC PIXEL command | 2100 |
+| BASIC PIXEL command | 2,100 |
-| INT 18h Pixel Plot (Assembly) | 27,000 |
+| INT 18h plot pixel | 26,000 |
-| PPU via INT 0x18 | 74,000 |
+| INT 18h plot pixel (no bounds check) | 30,000 |
+| PPU via INT 0x18 | 90,000 |
 | PPU via INT 0x03 (direct) | 220,000 |
 | PPIXEL (PPU via opcode) | 460,000 |
-These are practical numbers which show plot pixel being used to clear the screen in an x/y loop. 460,000 pixels/second is 7,600 pixels per frame at 60fps. That is almost exactly 30 16x16 sprites. Suggested use is to MEMCOPY a pre-drawn background into the frame-buffer and then draw sprites with PPIXEL (if you want to use PPIXEL for sprites). This way you don't have to draw the whole screen, and you don't have to draw the sprite background. It is hard to say how many sprites you will actually be able to draw in a frame because each sprite has more or less empty background space.
+The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest practical use for plot pixel; read X, Y and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2x, but you lose the loop overhead. The opcode version remains relatively 2x faster. This makes sense, since every call to INT 0x03 requires a LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call.
+,000 pixels/second is 7,600 pixels per frame at 60fps. That is almost exactly 30 16x16 sprites. Suggested use is to MEMCOPY a pre-drawn background into the frame-buffer and then draw sprites with PPIXEL (if you want to use PPIXEL for sprites). This way you don't have to draw the whole screen, and you don't have to draw the sprite background. It is hard to say how many sprites you will actually be able to draw in a frame because each sprite has more or less empty background space.
+This initial test proves that a PPU construct has value, that it can serve as a drop-in boost to the code in INT 18h, that it should be initially implemented as a subcall of AH=$01, INT $03, and that the move to a dedicated opcode's practical value is 95% code density and 5% improved speed.
 === Code Replacement
@@ Line 18: / Line 26: @@
     INT  $03         ; plot_pixel(X, Y, C)
-It replaces the old plot pixel function in the graphics library (AH $01 INT $18):
+PPU plot_pixel replaces the old plot pixel function in the graphics library (function #1 in INT 18h). This is some of the oldest code in the entire system, likely carried over from the SD-8510/VC-2:
+* [I:J] addressing mode
+* Manual bank 2 register load (LDI #2)
+* Register pressure PUSH X, PUSH C
+* Overuse of LDA at start, then shows LDB before end
+* Only uses A,B,X,Y,I,J,K,T
+This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. That's a good reason to keep this code around, it serves as the design document for the framebuffer and renderer itself. In other words, "it works." It's the source of truth over what it takes to plot a pixel in the framebuffer. The accelerated versions should get a unique entry inside INT 0x18, but not serve as drop-in replacements. That is probably the best way going forward.
 <codify armasm>
@@ Line 94: / Line 110: @@
     RET
 </codify>
+== PLINE
+These figures are for screen clears using LINE in a Y=0 to 199 loop.
+^ Method ^ Pixels per Second ^
+| BASIC LINE command | 22,000 |
+| INT 18h draw line | 26,000 |
+| PPU via INT 0x18 | 15,000,000 |
+| PPU via INT 0x03 (direct) | 55,000,000 |
+| PLINE (PPU via opcode) | 77,000,000 |
+As it turns out the line drawing algorithm, even though it is in assembly, is the limiting factor. The BASIC interpreter is executing 50 to 100 lines of assembly to get to INT 18h, but INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus 15 million PPS for the PPU draw_line is not that surprising; the load instructions (for X1, Y1, X2, Y2, etc.) operate with the loop overhead and cold be considered alongside a LOAD instruction; but the draw instruction itself operates 13,000 times faster than INT 18h draw_line (one INT call versus 13,000 lines of assembly); Minus a little for the loop overhead it's roughly 26000 * 13000 / 2.
+The big shock comes from a direct call; One, we're skipping one INT overhead, second, we do not need to repeat LDAH $21 since INT $3 preserves AH, third, we do not need a RET to match the INT 18h. The speedup is just about the same as dropping 3 instructions; a near 4x speedup.
+Now, what is the cost of an INT overhead? Apparently 21 million pixels per second. Of course at these speeds this might not matter as much. There is, however, one benefit to the INT call. It handles multimode. The way the PPU works is independent of the graphics mode. But to help speed the algorithm, the opcode version hardcodes everything. We would need a different opcode for every graphics mode or we will slow down. SO, it turns out, opcodes are a luxury! How many would we really need, realistically?
+* PPIXEL becomes PPIXEL3, PPIXEL3b, PPIXEL4, 4b, 8 and 8b. Six opcodes for full coverage.
+* We could do PPIXEL (mode) as an 8 bit immediate. This will cut the instruction speed approximately in half; oddly enough we would lose the advantage of having it as an opcode.
+* We could ignore opcodes for mode 8 since it only has 128x128 pixels.
+Let's do some tests and add to the above chart:
+| PLINE hardcoded | 77,000,000 |
+| PLINE option A | 71,000,000 |
+| PLINE option B | 75,000,000 |
+| PLINE option C | 76,000,000 |
+Option A includes an extra read for every pixel (a worst case test) and it shows only a small degredation in speed; an excellent sign.
+Option B and C include an extra read outside the loop. B uses softcoded values and C uses an IF to choose between two render routines. Branch prediction is probably making the IF nearly free, which is a good sign because that is the most versatile option. Then again the difference is very small, it could be statistical error.
+So as it turns out, the opcode version of the PPU call is the optimal way to call the PPU. Much like the old separate FPU chips on an 80x86; FMUL, FADD, FSQRT, FDIV were all opcodes, but run by the FPU. Our PPU should be implemented the same way.
+=== Fast tiles and sprites with PLINE
+If you have a smart drawing routine that reads contiguous pixels, then clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of a sprite, if the pixels are next to each other they can be detected as such as drawn via PLINE. While this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line draws. In a 16x16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image data.