User Tools

Site Tools


sd:sd-8516_ppu

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
sd:sd-8516_ppu [2026/05/02 23:08] appledogsd:sd-8516_ppu [2026/05/18 16:08] (current) appledog
Line 9: Line 9:
 == Primitives == Primitives
 === PPIXEL (plot_pixel) === PPIXEL (plot_pixel)
-^ Method ^ Pixels per Second ^+^ Method ^ Pixels per second (WASM) ^
 | BASIC PIXEL command | 2,100 | | BASIC PIXEL command | 2,100 |
 | INT 18h plot pixel | 26,000 | | INT 18h plot pixel | 26,000 |
 | INT 18h plot pixel (no bounds check) | 30,000 | | INT 18h plot pixel (no bounds check) | 30,000 |
 | PPU via INT 0x18 | 90,000 | | PPU via INT 0x18 | 90,000 |
-| PPU via INT 0x03 (direct) | 220,000 | +| PPU via INT 0x03 (direct) | 660,000 | 
-| PPIXEL (PPU via opcode) | 460,000 |+| PPIXEL (unrolled 1,000 times) | 1,300,000 |
  
-The number shows the pixels per second of the screen being cleared in a tight loopThis essentially represents the fastest practical use for plot pixel; read XY and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2xbut you lose the loop overheadThe opcode version remains relatively 2x faster. This makes sensesince every call to INT 0x03 requires LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call.+The numbers above only show that the plot rate is bound by the CPU. At one instruction per cycle, 1.28 MIPS gets ups 1.28 PIPS (pixels per second)Howeveronce you start adding in commands that load data, process color and so on, the number can dropA more realistic number might be closer to 300,000 pixels per second at 1 MIPSGood enough for a bitmapped fontgood enough for couple of sprites, good enough for the early console era (Intellivision, Atari 2600).
  
-460,000 pixels/second is 7,600 pixels per frame at 60fps. That is almost exactly 30 16x16 sprites. Suggested use is to MEMCOPY a pre-drawn background into the frame-buffer and then draw sprites with PPIXEL (if you want to use PPIXEL for sprites). This way you don't have to draw the whole screen, and you don'have to draw the sprite background. It is hard to say how many sprites you will actually be able to draw in frame because each sprite has more or less empty background space+How useful is PPIXEL if we want to push the limits? It would not be your go-to choice for anything past the late 70s era. Asteroids (1979arcade machine is a great exampleIt ran on a 6502 at 1.5mhz but required a special 'digital vector generatorchip to do the heavy lifting. The 6502 simply couldn'keep up with the refresh rate even with simple game like asteroids.
  
-This initial test proves that a PPU construct has value, that it can serve as a drop-in boost to the code in INT 18h, that it should be initially implemented as subcall of AH=$01INT $03, and that the move to a dedicated opcode's practical value is 95% code density and 5% improved speed.+Notably, the Apple ][ and VIC-20 had no PPU and relied on the 6502 for all graphics. Thereforeto approximate that era, you can use PPIXEL, or just write directly to the framebuffer. Rememberif you're targeting an era that didn't use PPUtry to avoid using the PPU to maintain the correct look and feel.
  
-==== Code Replacement +=== PLINE 
-PPIXEL can be called via INT 0x03:+The case for PLINE is the case for the Atari Digital Vector Generator chip used in Asteroids (1979) and many other games.
  
-    LDA  $0100       ; AH = 1 (PPU dispatch), AL = 00 (plot_pixel) 
-    INT  $03         ; plot_pixel(X, Y, C) 
- 
-PPU plot_pixel replaces the old plot pixel function in the graphics library (function #1 in INT 18h). This is some of the oldest code in the entire system, likely carried over from the SD-8510/VC-2: 
- 
-* [I:J] addressing mode 
-* Manual bank 2 register load (LDI #2) 
-* Register pressure PUSH X, PUSH C 
-* Overuse of LDA at start, then shows LDB before end 
-* Only uses A,B,X,Y,I,J,K,T 
- 
-This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. That's a good reason to keep this code around, it serves as the design document for the framebuffer and renderer itself. In other words, "it works." It's the source of truth over what it takes to plot a pixel in the framebuffer. The accelerated versions should get a unique entry inside INT 0x18, but not serve as drop-in replacements. That is probably the best way going forward. 
- 
-<codify armasm> 
-; ============================================================================ 
-; AH=01h - Plot MODE 3 Pixel (4bpp packed nibbles) 
-; ============================================================================ 
-; Input:  X = x coordinate (mode dependant) 
-;         Y = y coordinate (mode dependant) 
-;         C = color (0-15) 
-; Output: CF = 0 on success, 1 if out of bounds 
-; ============================================================================ 
-int18_plot_pixel_4bpp: 
-    PUSHA 
- 
-    ; Bounds check using SCREEN_WIDTH / SCREEN_HEIGHT 
-    LDA [@SCREEN_WIDTH] 
-    CMP X, A 
-    JC @plot4_error 
-    LDA [@SCREEN_HEIGHT] 
-    CMP Y, A 
-    JC @plot4_error 
- 
-    PUSH X                      ; save x 
-    PUSH C                      ; save color 
- 
-    LDA [@SCREEN_WIDTH] 
-    SHR A                       ; A = stride (width / 2) 
-    MOV B, A                    ; B = stride 
-    MOV A, Y                    ; A = y 
-    MUL A, B                    ; A = y * stride 
- 
-    POP C                       ; restore color 
-    POP B                       ; restore x into B 
-    PUSH B                      ; save x for nibble check 
-    MOV T, B 
-    SHR T 
-    ADD A, T                    ; A = byte offset 
- 
-    MOV J, A 
-    LDI #2 
- 
-    POP A                       ; A = x 
-    LDTL $01 
-    AND AL, TL 
-    CMP AL, $00 
-    JNZ @plot4_odd 
- 
-plot4_even: 
-    LDBL [I:J] 
-    MOV AL, BL 
-    UAB                         ; AL = odd pixel, BL = even pixel 
-    MOV BL, CL                  ; BL = new color 
-    PAB                         ; AL = (new color << 4) | odd pixel 
-    MOV BL, AL 
-    STBL [I:J] 
-    POPA 
-    CLC 
-    RET 
- 
-plot4_odd: 
-    LDBL [I:J] 
-    MOV AL, BL 
-    UAB                         ; AL = odd pixel, BL = even pixel 
-    MOV AL, CL                  ; AL = new color 
-    PAB                         ; AL = (even pixel << 4) | new color 
-    MOV BL, AL 
-    STBL [I:J] 
-    POPA 
-    CLC 
-    RET 
- 
-plot4_error: 
-    POPA 
-    SEC 
-    RET 
-</codify> 
- 
-=== PLINE 
 These figures are for screen clears using LINE in a Y=0 to 199 loop. These figures are for screen clears using LINE in a Y=0 to 199 loop.
  
-^ Method ^ Pixels per Second ^ +^ Method ^ Lines per Second ^ 
-| BASIC LINE command | 22,000 | +| BASIC LINE command | 1,000 | 
-| INT 18h draw line | 26,000 +| INT 18h draw line | 2,500 
-| BASIC using PPU via INT 0x18 | 156,000 | +| BASIC using PPU via INT 0x18 | 15,000 | 
-| BASIC using PPU direct call | 165,000 | +| BASIC using PPU direct call | 16,000 | 
-| PPU via INT 0x18 | 15,000,000 | +| PPU via INT 0x18 | 150,000 | 
-| PPU via INT 0x03 (direct) | 55,000,000 | +| PPU via INT 0x03 (direct) | 500,000 | 
-| PLINE (PPU via opcode) | 77,000,000 |+| PLINE (PPU via opcode) | 700,000 |
  
-As it turns out the line drawing algorithm, even though it is in assemblyis the limiting factorThe BASIC interpreter might execute 50 to 100 lines of assembly to get to INT 18hbut INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus replacing INT $18 with a PLINE opcode increased speed 7.5x to 165,000 pps fill rate. This implies 5-10 16x16 sprites with smart draw algorithm. Of course, for BASIC, you have DRAWCHARand I suspect PBLIT and series of SPRITE commands for BASIC will be the real breakthrough we need for our BASICBut it's nice to know PLINE works well in BASICThe numbers above represent 510 lines per second. With this much speed it is possible to write an ELITE clone in BASIC. That is going to be the benchmark.+As it turns out the line drawing algorithm is also 1 cycle bound. Meaning, in the tightest loop possibleat 1.28 MIPS700,000 lines is our limit. This is phenomenal number for the era, a much stronger result than a DVG chip of the eraHowever, if you move down to 0.3 MIPS, the numbers start making more sense.
  
-The big win comes from the direct call in assemblyThe speedup is equivalent to dropping five instructions; two INT callstwo RTIand a LDA for the inner INT $3 call. +An example would be writing an ELITE cloneClocking down to 0.35MIPS and using the INT 0x03 interfaceyou would still be comfortably north of 30,000 lines per second for a game that requires at most 5,000 lines for a busy sceneThis is the power of the PPU; it unlocks worldsYou could make 60fps flicker-free clone, or you could slow it down to 20fps for extra juicy retro appealWhy not? Ocarina of time did it. It'your choice.
- +
-==== Fast tiles and sprites in BASIC with PLINE +
-If you have smart drawing routine that reads contiguous pixelsthen clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of sprite, if the pixels are next to each other they can be detected as such as drawn via PLINEWhile this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line drawsIn 16x16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image dataThis idea fits BASIC's "DATA" statements very well as it would be in a compressed formatThe idea is you would store images in a kind of bytecode, and then the BASIC ''BLIT'' routine would read that format to quickly draw a sprite to screen. Of course, there is (will be) a PPU BLIT soon, so no one would use PLINE for this -- but it'possible to do so.+
  
 === PRECT and PFRECT === PRECT and PFRECT
 ''AH=$03 draw_rect''\\ ''AH=$04 fill_rect''\\ Tested and working. 160,000 rps and up in a normal loop. ''AH=$03 draw_rect''\\ ''AH=$04 fill_rect''\\ Tested and working. 160,000 rps and up in a normal loop.
 +
 +I will not comment much on this or on the next commands (PCIRCLE) except to say, they are much less frequently used, but, as far as draw primitives go they are lightning fast.
  
 === PCIRCLE and PFCIRCLE === PCIRCLE and PFCIRCLE
Line 144: Line 54:
 | large | 4500/sec | Art | | large | 4500/sec | Art |
  
-For any meaningful use casethis is probably good enough. At 100,000+ circles second at radius 30 and underyou could write any kind of game with this circle.+Againnot much to saythere is use for this in some gamesbut I expect this (and RECT) will mainly be used for UI work.
  
 === PCLEAR === PCLEAR
Line 150: Line 60:
  
 == Sprites == Sprites
-Sprites are arguably the reason for a PPU, and why it's called a PPU and not a 2d accelerator.+Sprites are arguably the reason for a PPU, and why it's called a PPU and not a 2d accelerator. Early consoles all had sprites, in contrast to microcomputers which didn't. The C64 changed that with 8 hardware sprites in 1982, and went on to dominate the era. After that, sprite count became a defining factor in console hardware.
  
-INT 18h LOAD and DRAW for 4bpp and 8bpp: Tested and working+Atari 2600 5 sprites (everything is a scanline) 
 +* Intellivision 8 sprites (no scanline limits) 
 +* Colecovsion 32 sprites (4 per scanline) 
 +* NES 64 sprites (8 per scanline) 
 +* SNES 128 sprites (32 per scanline)
  
-The INT 18h DRAW function can draw a 16x16 sprite in 1.3ms. This is very slow.+Our INT 18h sprite functions are mode-aware (4bpp and 8bpp) but slow. INT 18h DRAW can draw a 16x16 sprite in 1.3ms. This is very slow.
  
 <codify> <codify>
Line 182: Line 96:
     JMP @spritebench_loop     JMP @spritebench_loop
 </codify> </codify>
- 
-If you take out the spin-wait loop, the clear screen and the VSTEP instruction, this code executes 303 draws in 1.3ms per draw. That's painfully slow. Why is it so slow? The draw_sprite assembly routine goes through about 1,800 assembly instructions to draw a single sprite! We cannot MEMCOPY if we want pixel transparency in the framebuffer. And, even if we could, it would at best double our throughput on the fast path (for x & 1 == 0). 
- 
-The conclusion is, it works, but 1.3ms per draw is essentially unusable. Even a ColecoVision or a NES had more sprite power than the SD-8516 -- //but not without a PPU!// The PPU provides hardware accelerated sprites, and that's just what we need here. 
  
 === Wiring up the PPU === Wiring up the PPU
Line 259: Line 169:
 </codify> </codify>
  
-This test attempts to draw a sprite to every valid screen location; from 0,0 to 303,183 -- a total of 55,936 times. Can you guess how fast it was? The routine completed in 0.45. That's right, calibre 0.45 -- 0.45 microseconds that is. Yes, that's right, it's so fast I had to repeat the test so many times because one row (~303 blits) was completing in less than a milisecond. This is very fast, and very good! At this speed, we could draw 3600 sprites per frame. But this doesn't take into account clearing the screen and syncing the frame. Let's try a test; Can we draw 32 tiles per frame, at 60fps?+This test attempts to draw a sprite to every valid screen location; from 0,0 to 303,183 -- a total of 55,936 times. Can you guess how fast it was? The routine completed in 0.45 microseconds. This is very fast, and very good! At this speed, we could draw 3600 sprites per frame. But this doesn't take into account clearing the screen and syncing the frame. Let's try a test; Can we draw 32 tiles per frame, at 60fps?
  
 === Syncing the Frames === Syncing the Frames
Line 273: Line 183:
 * Up to 5600 sprites //per frame// at 30fps * Up to 5600 sprites //per frame// at 30fps
  
-Now, if you reserve 50% of your game for logic, that's 1024 sprites per frame at 60fps. The SNES could do 128 (but only 32 per scanline). This result places us firmly in the super-high end early 90s arcade board territory; Sega Y board (1988) was the first board to crack the 2000 barrier, while even later boards like the SNK Neo Geo (1990) were hardware limited to 381 sprites. It was boards like this that enabled the superscalar arcade games of the 90s. Having a microcomputer with this kind of graphics powerhouse would have been the dream of every microcomputer afficionado in the 80s/90s.+//Note: The C version trebles these numbers, reaching over 7,000 sprites per frame without trying to optimize the benchmark loop.// 
 + 
 +Now, if you reserve 50% of your game for logic, that'a minimum of 1024 sprites per frame at 60fps. The SNES could do 128 (but only 32 per scanline). This result places us firmly in the super-high end early 90s arcade board territory; Sega Y board (1988) was the first board to crack the 2000 barrier, while even later boards like the SNK Neo Geo (1990) were hardware limited to 381 sprites. It was boards like this that enabled the superscalar arcade games of the 90s. Having a microcomputer with this kind of graphics powerhouse would have been the dream of every microcomputer afficionado in the 80s/90s.
  
 //And don't knock 30fps. There is also some value to 30fps. Games like Teenage Mutant Ninja Turtles (NES, 1989), Ghosts 'n Goblins (NES, 1986) and Contra Force (NES, 1992) ran internally at 30fps. Even SNES games like Return of Double Dragon, or N64/PS1 games (ex. Super Mario 64) would run at 30fps. Other famous 30fps games include Ocarina of Time (N64, 1998), Soul Reaver (PS1, 1999), Resident Evil, Tomb Raider and GoldenEye 007. These were great, legendary games; Sometimes 30fps is ok. 45fps is definitely OK.// //And don't knock 30fps. There is also some value to 30fps. Games like Teenage Mutant Ninja Turtles (NES, 1989), Ghosts 'n Goblins (NES, 1986) and Contra Force (NES, 1992) ran internally at 30fps. Even SNES games like Return of Double Dragon, or N64/PS1 games (ex. Super Mario 64) would run at 30fps. Other famous 30fps games include Ocarina of Time (N64, 1998), Soul Reaver (PS1, 1999), Resident Evil, Tomb Raider and GoldenEye 007. These were great, legendary games; Sometimes 30fps is ok. 45fps is definitely OK.//
Line 287: Line 199:
 == Conclusion == Conclusion
 A decent tile and sprite engine is the foundation of an 80s/90s console PPU and of a high end arcade board. Extensive testing must be done before I can come back and add anything to this. But, there are a few little things that may prove useful, i.e. flip transforms. As it stands, the PPU is considered ready for testing in the main system. A decent tile and sprite engine is the foundation of an 80s/90s console PPU and of a high end arcade board. Extensive testing must be done before I can come back and add anything to this. But, there are a few little things that may prove useful, i.e. flip transforms. As it stands, the PPU is considered ready for testing in the main system.
 +
 +^ VC-4 ^ 2026 ^ 50,000+ ^
 +| Sega Saturn (2D) | 1994| ~5,000–8,000 |
 +| PSX (2D mode) | 1994 | ~4,000 |
 +| Sega Y Board (Galaxy Force II) | 1988 | ~2,000|
 +| 3DO | 1993 | ~1,500–2,000 (no formal limit) |
 +| Sega System 32 | 1990 | ~1,000 |
 +| Neo Geo MVS | 1990 | 381 |
 +| Capcom CPS-1 (Street Fighter II) | 1988 | 256 |
 +| Sega System 16 (Outrun) | 1986 | ~128 |
 +| NES | 1983 | 64 |
 +
 +Nothing touches the VC-4.
 +
 +
sd/sd-8516_ppu.1777763281.txt.gz · Last modified: by appledog

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki