SD-8516 PPU

This is a short reference to the XY-2000 PPU.

Introduction

The following shows the speed tests and design decisions that went into the XY-2000 PPU.

Why a PPU? I'm not making one “just because a nintendo has one” or something like that. I'm making one because as it stands, the computer doesn't run fast enough to allow top-end game programming in BASIC.

PPIXEL (plot_pixel)

Method	Pixels per Second
BASIC PIXEL command	2,100
INT 18h plot pixel	26,000
INT 18h plot pixel (no bounds check)	30,000
PPU via INT 0x18	90,000
PPU via INT 0x03 (direct)	220,000
PPIXEL (PPU via opcode)	460,000

The number shows the pixels per second of the screen being cleared in a tight loop. This essentially represents the fastest practical use for plot pixel; read X, Y and C data in a loop and plot it. If we unroll the loop by 20 times, the difference in speed between the INT 3 version and the OP.PPIXEL version remains about 2x, but you lose the loop overhead. The opcode version remains relatively 2x faster. This makes sense, since every call to INT 0x03 requires a LDA $0101 to access plot_pixel; the PPIXEL command gives this for free. Therefore PPIXEL is always at least twice as faster than INT 0x03 plot_pixel. It's actually a bit faster since it does not cross the host-guest bride inside the opcode call.

460,000 pixels/second is 7,600 pixels per frame at 60fps. That is almost exactly 30 16×16 sprites. Suggested use is to MEMCOPY a pre-drawn background into the frame-buffer and then draw sprites with PPIXEL (if you want to use PPIXEL for sprites). This way you don't have to draw the whole screen, and you don't have to draw the sprite background. It is hard to say how many sprites you will actually be able to draw in a frame because each sprite has more or less empty background space.

This initial test proves that a PPU construct has value, that it can serve as a drop-in boost to the code in INT 18h, that it should be initially implemented as a subcall of AH=$01, INT $03, and that the move to a dedicated opcode's practical value is 95% code density and 5% improved speed.

Code Replacement

PPIXEL can be called via INT 0x03:

  LDA  $0100       ; AH = 1 (PPU dispatch), AL = 00 (plot_pixel)
  INT  $03         ; plot_pixel(X, Y, C)

PPU plot_pixel replaces the old plot pixel function in the graphics library (function #1 in INT 18h). This is some of the oldest code in the entire system, likely carried over from the SD-8510/VC-2:

[I:J] addressing mode
Manual bank 2 register load (LDI #2)
Register pressure PUSH X, PUSH C
Overuse of LDA at start, then shows LDB before end
Only uses A,B,X,Y,I,J,K,T

This was one of the first routines ever written for the original VC-2, as a test to draw characters to a terminal screen. This was at the time I was constructing the cpu emulator itself; during the writing of this function I added the LDB, PAB and UAB opcodes. That's a good reason to keep this code around, it serves as the design document for the framebuffer and renderer itself. In other words, “it works.” It's the source of truth over what it takes to plot a pixel in the framebuffer. The accelerated versions should get a unique entry inside INT 0x18, but not serve as drop-in replacements. That is probably the best way going forward.

; ============================================================================
; AH=01h - Plot MODE 3 Pixel (4bpp packed nibbles)
; ============================================================================
; Input:  X = x coordinate (mode dependant)
;         Y = y coordinate (mode dependant)
;         C = color (0-15)
; Output: CF = 0 on success, 1 if out of bounds
; ============================================================================
int18_plot_pixel_4bpp:
    PUSHA

    ; Bounds check using SCREEN_WIDTH / SCREEN_HEIGHT
    LDA [@SCREEN_WIDTH]
    CMP X, A
    JC @plot4_error
    LDA [@SCREEN_HEIGHT]
    CMP Y, A
    JC @plot4_error

    PUSH X                      ; save x
    PUSH C                      ; save color

    LDA [@SCREEN_WIDTH]
    SHR A                       ; A = stride (width / 2)
    MOV B, A                    ; B = stride
    MOV A, Y                    ; A = y
    MUL A, B                    ; A = y * stride

    POP C                       ; restore color
    POP B                       ; restore x into B
    PUSH B                      ; save x for nibble check
    MOV T, B
    SHR T
    ADD A, T                    ; A = byte offset

    MOV J, A
    LDI #2

    POP A                       ; A = x
    LDTL $01
    AND AL, TL
    CMP AL, $00
    JNZ @plot4_odd

plot4_even:
    LDBL [I:J]
    MOV AL, BL
    UAB                         ; AL = odd pixel, BL = even pixel
    MOV BL, CL                  ; BL = new color
    PAB                         ; AL = (new color << 4) | odd pixel
    MOV BL, AL
    STBL [I:J]
    POPA
    CLC
    RET

plot4_odd:
    LDBL [I:J]
    MOV AL, BL
    UAB                         ; AL = odd pixel, BL = even pixel
    MOV AL, CL                  ; AL = new color
    PAB                         ; AL = (even pixel << 4) | new color
    MOV BL, AL
    STBL [I:J]
    POPA
    CLC
    RET

plot4_error:
    POPA
    SEC
    RET

PLINE

These figures are for screen clears using LINE in a Y=0 to 199 loop.

Method	Pixels per Second
BASIC LINE command	22,000
INT 18h draw line	26,000
BASIC using PPU via INT 0x18	156,000
BASIC using PPU direct call	165,000
PPU via INT 0x18	15,000,000
PPU via INT 0x03 (direct)	55,000,000
PLINE (PPU via opcode)	77,000,000

As it turns out the line drawing algorithm, even though it is in assembly, is the limiting factor. The BASIC interpreter might execute 50 to 100 lines of assembly to get to INT 18h, but INT 18h is executing over 13,000 lines of assembly to draw a single horizontal line from 0,0 to 319,0. Thus replacing INT $18 with a PLINE opcode increased speed 7.5x to 165,000 pps fill rate. This implies 5-10 16×16 sprites with a smart draw algorithm. Of course, for BASIC, you have DRAWCHAR, and I suspect PBLIT and a series of SPRITE commands for BASIC will be the real breakthrough we need for our BASIC. But it's nice to know PLINE works well in BASIC. The numbers above represent 510 lines per second. With this much speed it is possible to write an ELITE clone in BASIC. That is going to be the benchmark.

The big win comes from the direct call in assembly. The speedup is equivalent to dropping five instructions; two INT calls, two RTI, and a LDA for the inner INT $3 call.

Fast tiles and sprites in BASIC with PLINE

If you have a smart drawing routine that reads contiguous pixels, then clusters of pixels can draw in ~1 cycle. So instead of having to call PPIXEL four times for the top row of a sprite, if the pixels are next to each other they can be detected as such as drawn via PLINE. While this might not be practical on-the-fly, it is interesting to consider a packed image format that would allow this to operate quickly; i.e. store the sprite data itself as a series of line draws. In a 16×16 sprite it is conceivable you could use about half the number of PLINE calls as you would use PPIXEL. The difference is that you would need a special format for storing the image data. This idea fits BASIC's “DATA” statements very well as it would be in a compressed format. The idea is you would store images in a kind of bytecode, and then the BASIC BLIT routine would read that format to quickly draw a sprite to screen. Of course, there is (will be) a PPU BLIT soon, so no one would use PLINE for this – but it's possible to do so.

Appledog

Table of Contents