sd:sd-8516_ppu
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| sd:sd-8516_ppu [2026/04/25 05:52] – appledog | sd:sd-8516_ppu [2026/04/25 14:36] (current) – appledog | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| This is a short reference to the XY-2000 PPU. | This is a short reference to the XY-2000 PPU. | ||
| - | == Introduction: PPIXEL (plot_pixel) | + | == Introduction |
| + | The following shows the speed tests and design decisions that went into the XY-2000 PPU. | ||
| + | |||
| + | == PPIXEL (plot_pixel) | ||
| ^ Method ^ Pixels per Second ^ | ^ Method ^ Pixels per Second ^ | ||
| | BASIC PIXEL command | 2,100 | | | BASIC PIXEL command | 2,100 | | ||
| Line 12: | Line 15: | ||
| The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest practical use for plot pixel; read X, Y and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2x, but you lose the loop overhead. The opcode version remains relatively 2x faster. This makes sense, since every call to INT 0x03 requires a LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call. | The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest practical use for plot pixel; read X, Y and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2x, but you lose the loop overhead. The opcode version remains relatively 2x faster. This makes sense, since every call to INT 0x03 requires a LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call. | ||
| - | |||
| - | * //However, if all you are doing is loading XYC data and calling INT $03, then theoretically their performance will be within 5%-10% of each other; INT touches memory moreso than PPIXEL directly, but this amounts to an edge case; 31 vs 30 active sprites; not worth the enginering headache.// | ||
| 460,000 pixels/ | 460,000 pixels/ | ||
| Line 33: | Line 34: | ||
| * Only uses A, | * Only uses A, | ||
| - | This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. | + | This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. |
| <codify armasm> | <codify armasm> | ||
| Line 109: | Line 110: | ||
| RET | RET | ||
| </ | </ | ||
| + | |||
| + | == PLINE | ||
| + | These figures are for screen clears using LINE in a Y=0 to 199 loop. | ||
| + | |||
| + | ^ Method ^ Pixels per Second ^ | ||
| + | | BASIC LINE command | 22,000 | | ||
| + | | INT 18h draw line | 26,000 | | ||
| + | | PPU via INT 0x18 | 15,000,000 | | ||
| + | | PPU via INT 0x03 (direct) | 55,000,000 | | ||
| + | | PLINE (PPU via opcode) | 77,000,000 | | ||
| + | |||
| + | As it turns out the line drawing algorithm, even though it is in assembly, is the limiting factor. The BASIC interpreter is executing 50 to 100 lines of assembly to get to INT 18h, but INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus 15 million PPS for the PPU draw_line is not that surprising; the load instructions (for X1, Y1, X2, Y2, etc.) operate with the loop overhead and cold be considered alongside a LOAD instruction; | ||
| + | |||
| + | The big shock comes from a direct call; One, we're skipping one INT overhead, second, we do not need to repeat LDAH $21 since INT $3 preserves AH, third, we do not need a RET to match the INT 18h. The speedup is just about the same as dropping 3 instructions; | ||
| + | |||
| + | Now, what is the cost of an INT overhead? Apparently 21 million pixels per second. Of course at these speeds this might not matter as much. There is, however, one benefit to the INT call. It handles multimode. The way the PPU works is independent of the graphics mode. But to help speed the algorithm, the opcode version hardcodes everything. We would need a different opcode for every graphics mode or we will slow down. SO, it turns out, opcodes are a luxury! How many would we really need, realistically? | ||
| + | |||
| + | * PPIXEL becomes PPIXEL3, PPIXEL3b, PPIXEL4, 4b, 8 and 8b. Six opcodes for full coverage. | ||
| + | * We could do PPIXEL (mode) as an 8 bit immediate. This will cut the instruction speed approximately in half; oddly enough we would lose the advantage of having it as an opcode. | ||
| + | * We could ignore opcodes for mode 8 since it only has 128x128 pixels. | ||
| + | |||
| + | Let's do some tests and add to the above chart: | ||
| + | |||
| + | | PLINE hardcoded | 77,000,000 | | ||
| + | | PLINE option A | 71,000,000 | | ||
| + | | PLINE option B | 75,000,000 | | ||
| + | | PLINE option C | 76,000,000 | | ||
| + | |||
| + | Option A includes an extra read for every pixel (a worst case test) and it shows only a small degredation in speed; an excellent sign. | ||
| + | |||
| + | Option B and C include an extra read outside the loop. B uses softcoded values and C uses an IF to choose between two render routines. Branch prediction is probably making the IF nearly free, which is a good sign because that is the most versatile option. Then again the difference is very small, it could be statistical error. | ||
| + | |||
| + | So as it turns out, the opcode version of the PPU call is the optimal way to call the PPU. Much like the old separate FPU chips on an 80x86; FMUL, FADD, FSQRT, FDIV were all opcodes, but run by the FPU. Our PPU should be implemented the same way. | ||
| + | |||
| + | === Fast tiles and sprites with PLINE | ||
| + | If you have a smart drawing routine that reads contiguous pixels, then clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of a sprite, if the pixels are next to each other they can be detected as such as drawn via PLINE. While this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line draws. In a 16x16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image data. | ||
sd/sd-8516_ppu.1777096336.txt.gz · Last modified: by appledog
