= TinyC for the SD-8516 == Introduction * February 2026. A self-hosted C compiler has been my goal for quite some time. I did BASIC and Forth first. BASIC was useful, Forth -- not so much (from a practical standpoint). Enter SD/CC -- Stellar Dynamics C compiler. I just got sick of not having one so I wrote one. The key that helped me start was hearing something about the accumulator being used to return values -- like the TOS in Forth. This was something I had been thinking about for months. And so, I wrote a parser that required int main(void) { return 0; } and emitted LDA #0 and RET. The debugger was simply printing the return value. == March 2026 At the end of March I was able to get something set up where it could compile: int main(void) { return 0; } The problem was that isn't very much. Weeks went by and I couldn't figure it out. It was supposed to be just like the BASIC expression parser. A recursive descent parser. So the first thing I added was additive and multiplicative expressions. Then suddenly it started working. I honestly don't know what changed because it wasn't working then suddenly it was and I couldn't remember changing anything in the code. The key takeaway here is the concept of TOKENS and an EMITTER. The parser can consume TOKENS. This is a more abstract notion than keyword string matching in Stellar BASIC. More automated. The emitter is a bit like in forth but more granular. I have 'emitter' functions and they act like lego blocks. I have to call them in order and they will dump bytes to the stream. This part was taken from the javascript assembler. cc_parse_expr: ; Parse left operand CALL @cc_parse_term ; Check error LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pe_done cc_pe_loop: ; Check for '+' or '-' LDAL [@CC_TOKEN_TYPE] CMP AL, TOK_PLUS JZ @cc_pe_add CMP AL, TOK_MINUS JZ @cc_pe_sub This works for + and -, we can return 3+5 now. Next I added negative numbers. I had to add a check for TOK_MINUS first and then emit a SUB A, B to get it to negate. == Integers * April 14th, 2026. I've been trying to get a symbol table up and to define int; but it isn't working. I think there is an off by one bug on the stack. Storing variables on the stack is something I don't fully understand or trust yet. What is really biting me in the ass is the system I came up with to pair 16-bit registers into 24-bit pointers. It's a charmingly retro, period-authentic thing. But it makes coding a bit hard when you need multiple pointers. I have a 'secret weapon' -- the ability to add more registers any time I want, but, I want to avoid going down that route if possible. Having so many registers already is supposed to be good enough. === So what's a symbol table? in BASIC we have 26 letters for integer variables. It's like that. It's more like that than the NVAR variable system. But the point is, it's for ints. So it's a lot like Stellar BASIC's "int letters". It maps names to ints. The C source says ''a'', and the compiler needs to know: what is ''a''? Where does it live? What type is it? This is the problem LVAR/DVAR tried to solve and mainly failed. But, they led to this. This is like "S-VAR". But for C not BASIC. In an interpreter like BASIC the mapping exists at runtime. Every time execution hits ''LET A = 5'', the interpreter searches NVAR for "A" and stores the value 5 there. Every time it encounters ''PRINT A'', it searches NVAR again and reads 5 back. The name-to-value lookup happens over and over, on every execution of every line. A compiler eliminates that. During compilation, it resolves "a" to a stack location once, emits a machine instruction with the //location// hardcoded, and never thinks about the name "a" again. Like a label in the assembler. The compiled program doesn't contain the string "a" anywhere — just ''LDA [GLZ-2]''. The symbol table is the structure that makes this one-time resolution possible. Each entry is a 12-byte record laid out like this: * Offset 0-7: Name (null-terminated, padded to 8 bytes) * Offset 8: Type (0=int) * Offset 9: Kind (0=local) * Offset 10-11: Value (16-bit, the stack frame offset) The table lives at ''CC_SYMTAB'' and grows upward. ''CC_SYM_COUNT'' tracks how many entries exist. === Compiling ''int a = 5;'' # ''CC_FRAME_OFFSET'' goes from 0 to 2 (each int is 2 bytes) # ''cc_sym_add'' is called with name="a", type=0, kind=0, value=2 # It computes the slot address: ''CC_SYMTAB + (count * 12) = $00F040'' # Writes "a\0\0\0\0\0\0\0" (name padded to 8), then 0x00 (type), 0x00 (kind), 0x02 0x00 (value = 2, little-endian) # Increments ''CC_SYM_COUNT'' to 1 Next for ''int b = 3;'' # ''CC_FRAME_OFFSET'' goes from 2 to 4 # Slot address: $00F040 + (1 * 12) = $00F04C # Writes "b\0..." with value=4 When ''return a;'' needs the variable:** #. ''cc_sym_find'' does a linear scan from $00F040 #. Compares each record's name against ''CC_TOK_BUF'' using strcmp #. Finds "a" at record 0, returns a pointer (usu. BLY or LLZ) pointing to that record #. The caller reads the value field at BLY+10, gets 2 #. Emits ''LDA [GLZ-2]'' — load from frame pointer minus 2 When a new ''.compile'' starts: ''cc_sym_clear'' sets ''CC_SYM_COUNT'' to 0. Like calling NVAR_CLEAR on RUN in BASIC. The old data is still in memory, but invisible since nothing scans past count. The table is purely compile-time. The compiled program has no idea it exists. Variable names are resolved to numeric stack offsets during compilation and baked into the emitted instructions. === Using the Stack for Ints What I have been mulling over for the last couple months is using the stack to hold local variables. It's supposed to be something called a ''stack frame''. I hear about it, again, while I was implementing Forth. It was some kind of 'crazy thing C did,' or something. But it stuck in my mind. This was the key breakthrough. There were a lot of disparate concepets floating around but I finally nailed it down. it was a symbol table that stored handles (like a handle table in D-VAR system) but it was stored on the stack. Frankly I don't know who the marketing genius is who thought all this up. I think if we could go back in time things would have worked differently. This is some real alien technology stuff IMO. SO let it revolve around in your head. Why on the stack? === Why Stack Frames? The stack is used because local variables need to appear when a function is entered and disappear when it returns, and the stack naturally provides that lifetime. When ''int main(void) { int a = 5; int b = 3; return a + b; }'' is compiled, the emitted code does this: PUSH GLZ ; save old frame pointer (3 bytes on stack) MOV GLZ, SP ; GLZ = current stack top — this is our reference point SUB SP, #2 ; reserve 2 bytes for 'a' — SP moves down SUB SP, #2 ; reserve 2 bytes for 'b' — SP moves down again The stack now looks like this (growing downward): ┌───────────────────────┐ GLZ-2 │ return info │ ├───────────────────────┤ GLZ points │ saved GLZ (3 bytes) │ ├───────────────────────┤ GLZ-2 │ a (2 bytes) │ ├───────────────────────┤ GLZ-4 │ b (2 bytes) │ ├───────────────────────┤ SP points │ (free space) │ └───────────────────────┘ ''a'' is always at ''GLZ-2''. ''b'' is always at ''GLZ-4''. These offsets are fixed at compile time and baked into the instructions. ''STA [GLZ-2]'' stores into ''a''. ''LDA [GLZ-4]'' reads from ''b''. But why male models? (But why on the stack?) You could assign ''a'' to address $1000 and ''b'' to $1002. For a single function that's called once, it would work. The problem is recursion. Consider: int factorial(int n) { if (n == 0) return 1; return n * factorial(n - 1); } If ''n'' lives at a fixed address, the second call to ''factorial'' overwrites the first call's ''n''. When the inner call returns and the outer call tries to read ''n'', the value is gone. Every call needs its own private copy of ''n''. So you see, the real problem that is being solved here is memory management. The stack is a single source of 'pre-allocated memory' that we can use without bumping into other things, without a lot of complex mechanics. We could, in fact, define a heap area (ex. NVAR) but isn't that just... a stack? Boom, big reveal. They're all stacks, deep down. Even a linked list (INSQUE). It's just an unsorted stack with pointers to traverse the nodes in order. The ''stack'' (the PUSH/POP stack) solves this automatically. Each call to ''factorial'' executes ''PUSH GLZ / MOV GLZ, SP / SUB SP, #2'', which carves out a new region of stack below the previous call's region. Five nested calls means five separate copies of ''n'' at five different stack addresses, each accessible through its own saved GLZ value. When a call returns, ''MOV SP, GLZ / POP GLZ'' releases that region and restores the previous frame pointer. The caller's variables reappear at their original locations. == Working through if-else To get a new language feature like ''if-else'', we have to follow the same pattern as in BASIC. # Add it to the lexer (match keyword). # Parse the statement (done in the executor in BASIC, but a separate step here). # Emitter -- This is the big diff, since we don't do it 'now' but we have to do it 'later' (at runtime). The plan is the exact same as assembly (labels); back-patching, technically a two-pass, but done in the moment so we say it's one pass. # Emit the condition (cc_parse_expr — result in A) # Emit CMP A, #0 # Emit JZ with a placeholder address — save CC_HERE before writing it # Parse the then-block (emits its code) # Go back and patch the JZ address to point here (past the block) for if-else: # Emit condition (same) # Emit CMP A, #0 (same) # JZ placeholder (same) # Parse then-block (same) # Emit JMP placeholder (skip over else) # Patch placeholder to here # Parse else-block # Patch placeholder to here === Documenting the process Just like adding a keyword in BASIC. 1. Add the tokens. .equ TOK_IF 17 .equ TOK_ELSE 18 2. Add the strings. cc_kw_if: .bytes "if", 0 cc_kw_else: .bytes "else", 0 3. Add the keyword checks in the keyword checks section. ; Check "if" LDELM CC_TOK_BUF LDFLD @cc_kw_if LDAH $02 INT $12 JZ @cc_nt_kw_if ; Check "else" LDELM CC_TOK_BUF LDFLD @cc_kw_else LDAH $02 INT $12 JZ @cc_nt_kw_else And the handlers, cc_nt_kw_if: LDAL TOK_IF STAL [@CC_TOKEN_TYPE] POP I RET cc_nt_kw_else: LDAL TOK_ELSE STAL [@CC_TOKEN_TYPE] POP I RET We have to add it to the parser: cc_parse_stmt: LDAL [@CC_TOKEN_TYPE] CMP AL, TOK_RETURN JZ @cc_parse_return CMP AL, TOK_INT JZ @cc_parse_vardecl CMP AL, TOK_IDENT JZ @cc_parse_assign CMP AL, TOK_IF ; Added here at the end JZ @cc_parse_if ; Unknown statement LDAL #1 STAL [@CC_ERROR] LDELM @cc_str_expect_stmt LDAH $18 INT $10 RET Finally here is the if parser. Note three important changes from BASIC. # Instead of manually checking for things like ) the system uses a TOKEN table (TOK_LPAREN, etc). This generalized approach, I think, adds to thinkability and expandability. # You don't see int05_skip_space everywhere. Instead we have a cc_skip_ws which is included in cc_next_token. This isn't as 'inlined' as basic, but it doesn't need to be since it's a compiler. Keeping things like skip_space out of the way aids code understandability. # It's an emitter not an interpreter. Instead of doing it "now" we need to have a plan to do it 'later'. ; --- cc_parse_if --- ; 'if' '(' expr ')' block ('else' block)? ; ; Emitted code shape (if only): ; ← result in A ; CMP_IMM A, #0 ← test condition ; JZ patch1 ← skip then-block if false ; ; patch1: ← JZ lands here ; ; Emitted code shape (if/else): ; ; CMP_IMM A, #0 ; JZ patch1 ← skip to else if false ; ; JMP patch2 ← skip over else ; patch1: ← else starts here ; ; patch2: ← both paths rejoin here ; cc_parse_if: ; Consume 'if' CALL @cc_next_token ; Expect '(' LDAL TOK_LPAREN CALL @cc_expect LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pi_done ; Parse condition expression (result in A) CALL @cc_parse_expr LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pi_done ; Expect ')' LDAL TOK_RPAREN CALL @cc_expect LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pi_done ; Emit: CMP_IMM A, #0 (opcode=91, reg=0, imm16=0,0) LDAL #91 CALL @cc_emit_byte LDAL #0 ; REG_A CALL @cc_emit_byte LDAL #0 ; imm16 lo = 0 CALL @cc_emit_byte LDAL #0 ; imm16 hi = 0 CALL @cc_emit_byte ; Emit: JZ — save address of the operand for patching LDAL #101 ; JZ opcode CALL @cc_emit_byte ; Save CC_HERE — this is where the 3-byte address will go LDBLX [@CC_HERE] PUSH BLX ; *** patch1 address saved on stack *** ; Emit 3 placeholder bytes LDAL #0 CALL @cc_emit_byte CALL @cc_emit_byte CALL @cc_emit_byte ; Parse then-block CALL @cc_parse_block LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pi_bail1 ; Check for 'else' LDAL [@CC_TOKEN_TYPE] CMP AL, TOK_ELSE JZ @cc_pi_else ; No else = patch1 = CC_HERE (jump past then-block) POP BLX ; BLX = patch1 location CALL @cc_backpatch JMP @cc_pi_done cc_pi_else: ; Consume 'else' CALL @cc_next_token ; Emit: JMP at end of then-block LDAL #100 ; JMP opcode CALL @cc_emit_byte LDBLY [@CC_HERE] PUSH BLY ; *** patch2 address saved on stack *** LDAL #0 CALL @cc_emit_byte CALL @cc_emit_byte CALL @cc_emit_byte ; We need patch1 now. Swap: POP BLY ; BLY = patch2 POP BLX ; BLX = patch1 PUSH BLY ; re-push patch2 CALL @cc_backpatch ; patch1 = current CC_HERE ; Parse else-block CALL @cc_parse_block LDAL [@CC_ERROR] CMP AL, #0 JNZ @cc_pi_bail1 ; Patch2 = CC_HERE (JMP lands past else-block) POP BLX ; BLX = patch2 location CALL @cc_backpatch cc_pi_done: RET cc_pi_bail1: POP BLX ; discard patch address RET For a backpatch helper (in case we need this elsewhere) we just copy HERE: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; backpatch helper cc_backpatch: PUSH ELM LDELM [@CC_HERE] STELM [FLD] POP ELM RET === Why not inline the backpatch helper? Point of order. This is small enough to inline. And just like in the CPU, maybe I will end up inlining all the fetch_byte() calls. But this is a compiler -- we are not shooting for speed here, we are skating on the thin ice of a codebase which is bound to balloon in terms of complexity. Documentation and small, clean functions are king here. So, the point is being made by separating this out into a function. == while is easy Once we have backpatching and the expression machinery, every new control flow construct is just a new arrangement of jumps. ''for'' would be the same. The mantra is "All loops are possible with goto" (See: [[https://www.appledog.ca/wiki/doku.php?id=sd:sd-8516_user_s_guide#more_loops||more loops]] from the SD-8516 User's Guide). After while, the critical path to self-hosting is: * Functions and function calls * Pointers / char* * Array indexing == Functions Ahh the old y = x. F(x) = x^2. Opening up the discussion of //functions// opens up the discussion of forward referencing. But, I'm not going to worry about it because, remember, this C is only so we can build a self-hosting compiler and iterate on it. This version will be thrown away as soon as it can compile something better -- in C. //more to come// === Storing functions on the symbol table The symbol table is 16 bits and pointers here are 16 bits. We can hack this because we can assume everything will fit in the same bank. Later in the next version this won't matter, we will just make it 32 bits. //more to come// == Pointers, char * My take on pointers is that they will be ints on the symbol table. Like functions and ints. All locations on the table. Same with CHAR types? but char needs to be 1 byte, I think. But char* is just a pointer we look at like a char. //more to come// Since we're on a 16 bit symbol table, a pointer is a 16-bit address (same size as ''int''). Same storage, same stack slot. The difference is how we treat it. We treat this as a pointer to within the same bank. So for this TinyC all pointers are within the same bank as the code. Three things needed: * ''&x'' address-of operator * ''*p'' dereference read The pointer value is in A. Build a 24-bit pointer in a scratch register to load a value. Ex. TLK. * ''*p = expr'' dereference write Parse the expression (value in A), get the pointer into B, store through (ex. TLK). For char * the difference is it's width at load. char * is 8 bits, int * is 16 here. == Strings We should be able to do strings with char*. Char * was the next thing after int *. It was the usual. add TOK_CHAR. cc_kw_char. etc. The usual pointer clobber bugs. == Globals Can fake these with ints I think. //more to come// == Array indexing Decomposes into pointers IIRC. //more to come//