sd:sd-8516_ppu
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| sd:sd-8516_ppu [2026/04/25 04:47] – appledog | sd:sd-8516_ppu [2026/04/25 14:36] (current) – appledog | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| = SD-8516 PPU | = SD-8516 PPU | ||
| This is a short reference to the XY-2000 PPU. | This is a short reference to the XY-2000 PPU. | ||
| + | |||
| + | == Introduction | ||
| + | The following shows the speed tests and design decisions that went into the XY-2000 PPU. | ||
| == PPIXEL (plot_pixel) | == PPIXEL (plot_pixel) | ||
| ^ Method ^ Pixels per Second ^ | ^ Method ^ Pixels per Second ^ | ||
| - | | BASIC PIXEL command | 2100 | | + | | BASIC PIXEL command | 2,100 | |
| - | | INT 18h Pixel Plot (Assembly) | 27,000 | | + | | INT 18h plot pixel | 26, |
| - | | PPU via INT 0x18 | 74,000 | | + | | INT 18h plot pixel (no bounds check) | 30,000 | |
| + | | PPU via INT 0x18 | 90,000 | | ||
| | PPU via INT 0x03 (direct) | 220,000 | | | PPU via INT 0x03 (direct) | 220,000 | | ||
| | PPIXEL (PPU via opcode) | 460,000 | | | PPIXEL (PPU via opcode) | 460,000 | | ||
| - | These are practical | + | The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest |
| + | |||
| + | 460,000 pixels/ | ||
| + | |||
| + | This initial test proves that a PPU construct has value, that it can serve as a drop-in boost to the code in INT 18h, that it should be initially implemented as a subcall of AH=$01, INT $03, and that the move to a dedicated opcode' | ||
| === Code Replacement | === Code Replacement | ||
| Line 18: | Line 26: | ||
| INT $03 ; plot_pixel(X, | INT $03 ; plot_pixel(X, | ||
| - | It replaces the old plot pixel function in the graphics library (AH $01 INT $18): | + | PPU plot_pixel |
| + | |||
| + | * [I:J] addressing mode | ||
| + | * Manual bank 2 register load (LDI #2) | ||
| + | * Register pressure PUSH X, PUSH C | ||
| + | * Overuse of LDA at start, then shows LDB before end | ||
| + | * Only uses A, | ||
| + | |||
| + | This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. That's a good reason to keep this code around, it serves as the design document for the framebuffer and renderer itself. In other words, "it works." | ||
| <codify armasm> | <codify armasm> | ||
| Line 94: | Line 110: | ||
| RET | RET | ||
| </ | </ | ||
| + | |||
| + | == PLINE | ||
| + | These figures are for screen clears using LINE in a Y=0 to 199 loop. | ||
| + | |||
| + | ^ Method ^ Pixels per Second ^ | ||
| + | | BASIC LINE command | 22,000 | | ||
| + | | INT 18h draw line | 26,000 | | ||
| + | | PPU via INT 0x18 | 15,000,000 | | ||
| + | | PPU via INT 0x03 (direct) | 55,000,000 | | ||
| + | | PLINE (PPU via opcode) | 77,000,000 | | ||
| + | |||
| + | As it turns out the line drawing algorithm, even though it is in assembly, is the limiting factor. The BASIC interpreter is executing 50 to 100 lines of assembly to get to INT 18h, but INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus 15 million PPS for the PPU draw_line is not that surprising; the load instructions (for X1, Y1, X2, Y2, etc.) operate with the loop overhead and cold be considered alongside a LOAD instruction; | ||
| + | |||
| + | The big shock comes from a direct call; One, we're skipping one INT overhead, second, we do not need to repeat LDAH $21 since INT $3 preserves AH, third, we do not need a RET to match the INT 18h. The speedup is just about the same as dropping 3 instructions; | ||
| + | |||
| + | Now, what is the cost of an INT overhead? Apparently 21 million pixels per second. Of course at these speeds this might not matter as much. There is, however, one benefit to the INT call. It handles multimode. The way the PPU works is independent of the graphics mode. But to help speed the algorithm, the opcode version hardcodes everything. We would need a different opcode for every graphics mode or we will slow down. SO, it turns out, opcodes are a luxury! How many would we really need, realistically? | ||
| + | |||
| + | * PPIXEL becomes PPIXEL3, PPIXEL3b, PPIXEL4, 4b, 8 and 8b. Six opcodes for full coverage. | ||
| + | * We could do PPIXEL (mode) as an 8 bit immediate. This will cut the instruction speed approximately in half; oddly enough we would lose the advantage of having it as an opcode. | ||
| + | * We could ignore opcodes for mode 8 since it only has 128x128 pixels. | ||
| + | |||
| + | Let's do some tests and add to the above chart: | ||
| + | |||
| + | | PLINE hardcoded | 77,000,000 | | ||
| + | | PLINE option A | 71,000,000 | | ||
| + | | PLINE option B | 75,000,000 | | ||
| + | | PLINE option C | 76,000,000 | | ||
| + | |||
| + | Option A includes an extra read for every pixel (a worst case test) and it shows only a small degredation in speed; an excellent sign. | ||
| + | |||
| + | Option B and C include an extra read outside the loop. B uses softcoded values and C uses an IF to choose between two render routines. Branch prediction is probably making the IF nearly free, which is a good sign because that is the most versatile option. Then again the difference is very small, it could be statistical error. | ||
| + | |||
| + | So as it turns out, the opcode version of the PPU call is the optimal way to call the PPU. Much like the old separate FPU chips on an 80x86; FMUL, FADD, FSQRT, FDIV were all opcodes, but run by the FPU. Our PPU should be implemented the same way. | ||
| + | |||
| + | === Fast tiles and sprites with PLINE | ||
| + | If you have a smart drawing routine that reads contiguous pixels, then clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of a sprite, if the pixels are next to each other they can be detected as such as drawn via PLINE. While this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line draws. In a 16x16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image data. | ||
sd/sd-8516_ppu.1777092466.txt.gz · Last modified: by appledog
