Differences

This shows you the differences between two versions of the page.

--- sd:sd-8516_ppu [2026/04/25 05:52] – appledog
+++ sd:sd-8516_ppu [2026/04/25 14:36] (current) – appledog
@@ Line 2: / Line 2: @@
 This is a short reference to the XY-2000 PPU.
-== Introduction: PPIXEL (plot_pixel)
+== Introduction
+The following shows the speed tests and design decisions that went into the XY-2000 PPU.
+== PPIXEL (plot_pixel)
 ^ Method ^ Pixels per Second ^
 | BASIC PIXEL command | 2,100 |
@@ Line 12: / Line 15: @@
 The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest practical use for plot pixel; read X, Y and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2x, but you lose the loop overhead. The opcode version remains relatively 2x faster. This makes sense, since every call to INT 0x03 requires a LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call.
-* //However, if all you are doing is loading XYC data and calling INT $03, then theoretically their performance will be within 5%-10% of each other; INT touches memory moreso than PPIXEL directly, but this amounts to an edge case; 31 vs 30 active sprites; not worth the enginering headache.//
 ,000 pixels/second is 7,600 pixels per frame at 60fps. That is almost exactly 30 16x16 sprites. Suggested use is to MEMCOPY a pre-drawn background into the frame-buffer and then draw sprites with PPIXEL (if you want to use PPIXEL for sprites). This way you don't have to draw the whole screen, and you don't have to draw the sprite background. It is hard to say how many sprites you will actually be able to draw in a frame because each sprite has more or less empty background space.
@@ Line 33: / Line 34: @@
 * Only uses A,B,X,Y,I,J,K,T
-This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. I also added PXY and UXY which were later deprecated (as this is almost the only time they are ever used). I have to be honest, I am having trouble letting go of this old code, it doesn't feel right somehow. But the PPU version is 20-30x faster on plot_pixel alone. Perhaps I will leave this function in INT 0x18 as a relic. Never to be used, but, there for educational and historical purposes. If I ever made a next-generation emulator, I would need code like this at the start.. to get things started, before writing a PPU.
+This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. That's a good reason to keep this code around, it serves as the design document for the framebuffer and renderer itself. In other words, "it works." It's the source of truth over what it takes to plot a pixel in the framebuffer. The accelerated versions should get a unique entry inside INT 0x18, but not serve as drop-in replacements. That is probably the best way going forward.
 <codify armasm>
@@ Line 109: / Line 110: @@
     RET
 </codify>
+== PLINE
+These figures are for screen clears using LINE in a Y=0 to 199 loop.
+^ Method ^ Pixels per Second ^
+| BASIC LINE command | 22,000 |
+| INT 18h draw line | 26,000 |
+| PPU via INT 0x18 | 15,000,000 |
+| PPU via INT 0x03 (direct) | 55,000,000 |
+| PLINE (PPU via opcode) | 77,000,000 |
+As it turns out the line drawing algorithm, even though it is in assembly, is the limiting factor. The BASIC interpreter is executing 50 to 100 lines of assembly to get to INT 18h, but INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus 15 million PPS for the PPU draw_line is not that surprising; the load instructions (for X1, Y1, X2, Y2, etc.) operate with the loop overhead and cold be considered alongside a LOAD instruction; but the draw instruction itself operates 13,000 times faster than INT 18h draw_line (one INT call versus 13,000 lines of assembly); Minus a little for the loop overhead it's roughly 26000 * 13000 / 2.
+The big shock comes from a direct call; One, we're skipping one INT overhead, second, we do not need to repeat LDAH $21 since INT $3 preserves AH, third, we do not need a RET to match the INT 18h. The speedup is just about the same as dropping 3 instructions; a near 4x speedup.
+Now, what is the cost of an INT overhead? Apparently 21 million pixels per second. Of course at these speeds this might not matter as much. There is, however, one benefit to the INT call. It handles multimode. The way the PPU works is independent of the graphics mode. But to help speed the algorithm, the opcode version hardcodes everything. We would need a different opcode for every graphics mode or we will slow down. SO, it turns out, opcodes are a luxury! How many would we really need, realistically?
+* PPIXEL becomes PPIXEL3, PPIXEL3b, PPIXEL4, 4b, 8 and 8b. Six opcodes for full coverage.
+* We could do PPIXEL (mode) as an 8 bit immediate. This will cut the instruction speed approximately in half; oddly enough we would lose the advantage of having it as an opcode.
+* We could ignore opcodes for mode 8 since it only has 128x128 pixels.
+Let's do some tests and add to the above chart:
+| PLINE hardcoded | 77,000,000 |
+| PLINE option A | 71,000,000 |
+| PLINE option B | 75,000,000 |
+| PLINE option C | 76,000,000 |
+Option A includes an extra read for every pixel (a worst case test) and it shows only a small degredation in speed; an excellent sign.
+Option B and C include an extra read outside the loop. B uses softcoded values and C uses an IF to choose between two render routines. Branch prediction is probably making the IF nearly free, which is a good sign because that is the most versatile option. Then again the difference is very small, it could be statistical error.
+So as it turns out, the opcode version of the PPU call is the optimal way to call the PPU. Much like the old separate FPU chips on an 80x86; FMUL, FADD, FSQRT, FDIV were all opcodes, but run by the FPU. Our PPU should be implemented the same way.
+=== Fast tiles and sprites with PLINE
+If you have a smart drawing routine that reads contiguous pixels, then clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of a sprite, if the pixels are next to each other they can be detected as such as drawn via PLINE. While this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line draws. In a 16x16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image data.