sd:tinyc_for_the_sd-8516
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| sd:tinyc_for_the_sd-8516 [2026/04/17 06:21] – appledog | sd:tinyc_for_the_sd-8516 [2026/06/05 16:53] (current) – appledog | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| == Introduction | == Introduction | ||
| - | This TinyC is more like the " | + | < |
| - | Current Feature Set: | + | TinyC refers to V1 (tinyc.asm) and V2 (tinyc.c), which is the same as V1 but written in C. V1 is intended to compile V2, and then we have self-hosting C. Thus, this TinyC should be more like the " |
| + | |||
| + | V1 Current Feature Set: | ||
| * Expressions with precedence | * Expressions with precedence | ||
| * Local variables | * Local variables | ||
| Line 24: | Line 26: | ||
| This is enough that I was able to start writing functions in C like abs, min, max, strlen, strcmp, atoi/itoa, and so on. | This is enough that I was able to start writing functions in C like abs, min, max, strlen, strcmp, atoi/itoa, and so on. | ||
| + | |||
| + | == Function Library | ||
| + | * [[List of TinyC built-in functions]] | ||
| == History | == History | ||
| - | A self-hosting C compiler had been my goal for quite some time. But I didn't know how to write one. I hacked out a BASIC first, then followed some guides and source code to get a Forth running. But C eluded me for several months. Eventually I was able to leverage what I had learned doing BASIC to get a parser that could compile int main(void) { return 0; } and emitted LDA #0 and RET. The debugger was simply printing the return value. | + | A self-hosting C compiler had been my goal for quite some time. But I didn't know how to write one. I hacked out a BASIC first, then followed some guides and source code to get a Forth running. But C eluded me for several months. Eventually I was able to leverage what I had learned doing BASIC to get a parser that could compile int main(void) { return 0; } and emitted LDA #0 and RET. The debugger was simply printing the return value. The rest of the story is found in the [[TinyC Developer Diary]]. Mainly so I don't forget what I learned, as this project has gotten quite big. |
| - | === Expression Parser | + | == Issues |
| - | I did expressions first because | + | The big issue is that I don't //really// know what I am doing. Meaning, it' |
| - | === Three-Level Descent | + | What is even better is if I could write relocatable code. You know, simply writing this out all gives me ideas. Relocatable code! Could be interesting. Ok let's write down some ideas-- [[Relocatable Code]] |
| - | 2 + 3 * 4 | + | |
| - | Should this write 20 or 14? | + | === Compile to disk |
| + | The other solution I am mulling is to rewrite the emitter. There' | ||
| - | The expression parser has to produce code that can handle BEDMAS at run-time. It does this by passing precedence down into three main functions, each responsible for one precedence level. Higher precedence operators bind their operands | + | == What's Missing? |
| + | * Preprocessor: | ||
| + | Structs and unions: cannot define composite types. No member access (., ->). | ||
| + | * Typedef: no type aliases. | ||
| + | * Multi-dimensional arrays and array declarations: | ||
| + | * Initializers: | ||
| + | * String literals as data: strings work only as immediate | ||
| + | * Function pointers: no syntax, no calling convention. | ||
| + | * Variable arguments: no ..., no va_list. | ||
| + | * enum: no constant enumeration. | ||
| + | * static, extern, register, volatile, const: storage class and qualifier keywords missing. | ||
| + | * Multiple return types: only int (well, 24-bit). No void, char, short, long, float, double returns. | ||
| + | * void type: parsed in (void) parameter list only; can't declare void variables, return void, or use void | ||
| - | ^ Level ^ Function ^ Operators ^ Calls Down To ^ | + | === Operator gaps |
| - | | Comparison | '' | + | * Bitwise ops are critical for systems programming. Without them, no masking, no flag manipulation. |
| - | | Additive | + | * Logical && and || with short-circuit semantics matter for correctness, |
| - | | Multiplicative | ''' | + | * ++/-- are syntactic sugar but their absence is glaring. |
| - | | Primary | '' | + | * % (modulo) for any non-trivial arithmetic. |
| + | * Compound assignment is just sugar but expected. | ||
| - | Each level parses its left operand by calling the next level down, then loops checking for operators | + | === Library |
| + | * No standard library | ||
| + | * No file I/O from compiled programs (peek/poke don't qualify). | ||
| + | * No dynamic memory. | ||
| - | === Example; 2 + 3 * 4 | + | === Compiler quality |
| - | * 1. '' | + | * No optimization passes. Every operation goes through BLA via the stack. |
| - | * 2. '' | + | * No type checking. Assigning char * to int is silent. |
| - | * 3. '' | + | * No warnings |
| - | * 4. '' | + | * Error messages |
| - | * 5. '' | + | * No symbol scoping beyond |
| - | * 6. '' | + | |
| - | * 7. '' | + | |
| - | * 8. '' | + | |
| - | * 9. '' | + | |
| - | * 10. '' | + | |
| - | * 11. '' | + | |
| - | * 12. '' | + | |
| - | * 13. '' | + | |
| - | + | ||
| - | The key insight is at step 4: '' | + | |
| - | + | ||
| - | === Emits | + | |
| - | Every operator follows the same code generation template. The left operand is already in register A (from the recursive call). The compiler saves it, evaluates the right operand (which also lands in A), then combines them: | + | |
| - | + | ||
| - | PUSH A ; save left operand on stack | + | |
| - | <code here> | + | |
| - | MOV B, A ; move right operand to B | + | |
| - | POP A ; restore left operand to A | + | |
| - | ADD A, B ; A = left + right (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | This looks inefficient but remember each individual lego block puts it's result into A. So we need to move A around a bit. Actually we can do this: | + | |
| - | + | ||
| - | MOV B, A ; move left operand to B | + | |
| - | <code here> | + | |
| - | ADD A, B ; A = right plus left (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | With a little extra code we can emit into B. A later optimization perhaps. | + | |
| - | + | ||
| - | <code here> | + | |
| - | ADD A, B ; A = left + right (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | This pattern is uniform across all arithmetic operators. The only difference is the final instruction: | + | |
| - | + | ||
| - | === From the C version | + | |
| - | The additive level in the C version illustrates the pattern: | + | |
| - | + | ||
| - | <codify C> | + | |
| - | int parse_additive() { | + | |
| - | parse_term(); | + | |
| - | if (g_error) { return 0; } | + | |
| - | while (1) { | + | |
| - | if (g_tok_type == TOK_PLUS) { | + | |
| - | next_token(); | + | |
| - | emit_push_a(); | + | |
| - | parse_term(); | + | |
| - | emit_mov_b_a(); | + | |
| - | emit_pop_a(); | + | |
| - | emit_add_a_b(); | + | |
| - | } else if (g_tok_type == TOK_MINUS) { | + | |
| - | next_token(); | + | |
| - | emit_push_a(); | + | |
| - | parse_term(); | + | |
| - | emit_mov_b_a(); | + | |
| - | emit_pop_a(); | + | |
| - | emit_sub_a_b(); | + | |
| - | } else { | + | |
| - | break; | + | |
| - | } | + | |
| - | } | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | The multiplicative level is structurally identical, with '' | + | |
| - | + | ||
| - | + | ||
| - | === The B in BEDMAS | + | |
| - | Brackets in expressions (parentheses) override precedence by recursing back to the top level. When '' | + | |
| - | + | ||
| - | <codify C> | + | |
| - | // Inside parse_primary: | + | |
| - | if (g_tok_type == TOK_LPAREN) { | + | |
| - | next_token(); | + | |
| - | parse_expr(); | + | |
| - | expect(TOK_RPAREN); | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | === Comparisons (<, ==, >) | + | |
| - | Comparisons sit at the lowest precedence level, handled by `parse_expr` itself. Unlike arithmetic operators, comparisons can not loop. '' | + | |
| - | + | ||
| - | The emitted code for `a == 5` is simple. It kind of reminds me of forth; imagine if ' | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | (load a value into A) | + | |
| - | LDB #5 ; load value to check against | + | |
| - | CMP A, B ; sets CPU flags | + | |
| - | JZ true_case | + | |
| - | LDA #0 ; false | + | |
| - | JMP done | + | |
| - | + | ||
| - | true_case: | + | |
| - | LDA #1 ; true | + | |
| - | done: | + | |
| - | ; A = 0 or 1 | + | |
| - | </ | + | |
| - | + | ||
| - | === Symbol table | + | |
| - | Arguably the symbol table is the core milestone. I had been working on various variable storage systems until this, and this is, I think, the best one so far. | + | |
| - | + | ||
| - | All I wanted from a variable system was the ability to map a string to a string. That was the core idea. If I could do that I could have a variable system. So when the programmer writes `int x = 5;` on one line and `return x;` three lines later, the compiler needs to set the string " | + | |
| - | + | ||
| - | But in BASIC, every time a line executes, the interpreter searches for the variable by name. From the beginning. This was very slow. A compiler is different: it resolves names to locations once during compilation and bakes the result into the machine code. The compiled program never sees the symbol table. That's the C way. Compiled, not interpreted. The idea comes directly from the javascript assembler. It inserts dummy values for pointers until the labels can be resolved. AKA back-patching. | + | |
| - | + | ||
| - | === Symbol table record structure | + | |
| - | The symbol table is a flat array of fixed-size records. I was worried | + | |
| - | + | ||
| - | ^ Offset ^ Size ^ Field ^ Description | | + | |
| - | | 0–15 | 16 | Name | Null-terminated, | + | |
| - | | 16 | 1 | Type | 0=int, 1=int*, 2=char, 3=char* | | + | |
| - | | 17 | 1 | Kind | 0=local, 1=param, 2=function, 3=global | | + | |
| - | | 18–19 | 2 | Value | 16-bit: stack offset (locals/ | + | |
| - | + | ||
| - | Note: name size was increased to 15 from 7 because some function names like get_thing() were larger than 7 characters. If we run out of space change this first. | + | |
| - | + | ||
| - | === How it works | + | |
| - | The big reveal is that it doesn' | + | |
| - | + | ||
| - | Three operations manage the table, like NVAR_SET, NVAR_GET and NVAR_CLEAR, here we use SVAR_SET (sym_add), SVAR_GET (sym_find) and SVAR_CLEAR (sym_clear). | + | |
| - | + | ||
| - | * SVAR_SET/ | + | |
| - | ** Appends a new entry. Needs name, and value like NVAR but additionally needs type (ex int, char) and kind (ex function, global, param). | + | |
| - | + | ||
| - | * SVAR_GET/ | + | |
| - | ** Performs a linear scan from the first entry, comparing each stored name against the search string using '' | + | |
| - | + | ||
| - | * SVAR_CLEAR/ | + | |
| - | ** resets the count to zero. Just write a zero at the start to show no variables. | + | |
| - | + | ||
| - | === Scoping using the stack | + | |
| - | The compiler implements function-level scoping by saving the values on the stack and only keeping a handle to them in the symbol table. When parsing a function definition, the current symbol count is saved into '' | + | |
| - | + | ||
| - | === Type vs Kind | + | |
| - | * Kind 0 = Local Variable | + | |
| - | ** A positive integer representing the distance below the frame pointer (GLZ). | + | |
| - | + | ||
| - | * Kind 1 = Params | + | |
| - | ** A positive integer representing the distance above the frame pointer. Accessed as '' | + | |
| - | + | ||
| - | * Kind 2 = Functions | + | |
| - | ** The 16-bit code address where the function's compiled body begins. Used by '' | + | |
| - | + | ||
| - | * Kind 3 = Globals | + | |
| - | ** A fixed memory address in bank 0, allocated from a bump pointer starting at '' | + | |
| - | + | ||
| - | === Compiling '' | + | |
| - | # '' | + | |
| - | # '' | + | |
| - | # It computes the slot address: '' | + | |
| - | # Writes " | + | |
| - | # Increments '' | + | |
| - | + | ||
| - | Next for '' | + | |
| - | + | ||
| - | # '' | + | |
| - | # Slot address: $00F040 + (1 * 12) = $00F04C | + | |
| - | # Writes " | + | |
| - | + | ||
| - | When '' | + | |
| - | + | ||
| - | #. '' | + | |
| - | #. Compares each record' | + | |
| - | #. Finds " | + | |
| - | #. The caller reads the value field at BLY+10, gets 2 | + | |
| - | #. Emits '' | + | |
| - | + | ||
| - | When a new '' | + | |
| - | + | ||
| - | '' | + | |
| - | + | ||
| - | The table is purely compile-time. The compiled program has no idea it exists. Variable names are resolved to numeric stack offsets during compilation and baked into the emitted instructions. | + | |
| - | + | ||
| - | === But why stack frames? | + | |
| - | The stack is used because | + | |
| - | + | ||
| - | When '' | + | |
| - | + | ||
| - | PUSH GLZ ; save old frame pointer (3 bytes on stack) | + | |
| - | MOV GLZ, SP ; GLZ = current stack top. This is our reference point | + | |
| - | SUB SP, #2 ; reserve 2 bytes for ' | + | |
| - | SUB SP, #2 ; reserve 2 bytes for ' | + | |
| - | + | ||
| - | The stack now looks like this (growing downward): | + | |
| - | + | ||
| - | ┌───────────────────────┐ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | GLZ points | + | |
| - | ├───────────────────────┤ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | SP points | + | |
| - | └───────────────────────┘ | + | |
| - | + | ||
| - | '' | + | |
| - | + | ||
| - | But why male models? | + | |
| - | + | ||
| - | (But why on the stack?) It boils down to a memory management issue. You could assign '' | + | |
| - | + | ||
| - | int factorial(int n) { | + | |
| - | if (n == 0) return 1; | + | |
| - | return n * factorial(n - 1); | + | |
| - | } | + | |
| - | + | ||
| - | If '' | + | |
| - | + | ||
| - | The solution is, every call needs its own private copy of '' | + | |
| - | + | ||
| - | == Working through if-else | + | |
| - | After the symbol table, everything else started to fall into place quickly; if was next, and I already had the concept from the expression parser in BASIC. | + | |
| - | + | ||
| - | By this time the language structure was starting to become established so I will write down how I added if as a guide for other keywords. | + | |
| - | + | ||
| - | To get a new language feature like '' | + | |
| - | + | ||
| - | # Add it to the lexer (match keyword). | + | |
| - | # Parse the statement (done in the executor in BASIC, but a separate step here). | + | |
| - | # Emitter -- This is the big diff, since we don't do it ' | + | |
| - | + | ||
| - | The plan is the exact same as assembly (labels); back-patching, | + | |
| - | + | ||
| - | # Emit the condition (cc_parse_expr, | + | |
| - | # Emit CMP A, #0 | + | |
| - | # Emit JZ with a placeholder address. Save CC_HERE before writing it | + | |
| - | # Parse the then-block (emits its code) | + | |
| - | # Go back and patch the JZ address to point here (past the block) | + | |
| - | + | ||
| - | for if-else: | + | |
| - | + | ||
| - | # Emit condition (same) | + | |
| - | # Emit CMP A, #0 (same) | + | |
| - | # JZ placeholder (same) | + | |
| - | # Parse then-block (same) | + | |
| - | # Emit JMP placeholder (skip over else) | + | |
| - | # Patch placeholder to here | + | |
| - | # Parse else-block | + | |
| - | # Patch placeholder to here | + | |
| - | + | ||
| - | ==== Documenting the process | + | |
| - | Just like adding a keyword in BASIC. | + | |
| - | + | ||
| - | 1. Add the tokens. | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | .equ TOK_IF | + | |
| - | .equ TOK_ELSE | + | |
| - | </ | + | |
| - | + | ||
| - | 2. Add the strings. | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_kw_if: | + | |
| - | .bytes " | + | |
| - | + | ||
| - | cc_kw_else: | + | |
| - | .bytes " | + | |
| - | </ | + | |
| - | + | ||
| - | 3. Add the keyword checks in the keyword checks section. | + | |
| - | <codify armasm> | + | |
| - | ; Check " | + | |
| - | LDELM CC_TOK_BUF | + | |
| - | LDFLD @cc_kw_if | + | |
| - | LDAH $02 | + | |
| - | INT $12 | + | |
| - | JZ @cc_nt_kw_if | + | |
| - | + | ||
| - | ; Check " | + | |
| - | LDELM CC_TOK_BUF | + | |
| - | LDFLD @cc_kw_else | + | |
| - | LDAH $02 | + | |
| - | INT $12 | + | |
| - | JZ @cc_nt_kw_else | + | |
| - | </ | + | |
| - | + | ||
| - | And the handlers, | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_nt_kw_if: | + | |
| - | LDAL TOK_IF | + | |
| - | STAL [@CC_TOKEN_TYPE] | + | |
| - | POP I | + | |
| - | RET | + | |
| - | + | ||
| - | cc_nt_kw_else: | + | |
| - | LDAL TOK_ELSE | + | |
| - | STAL [@CC_TOKEN_TYPE] | + | |
| - | POP I | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | We have to add it to the parser: | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_parse_stmt: | + | |
| - | LDAL [@CC_TOKEN_TYPE] | + | |
| - | + | ||
| - | CMP AL, TOK_RETURN | + | |
| - | JZ @cc_parse_return | + | |
| - | + | ||
| - | CMP AL, TOK_INT | + | |
| - | JZ @cc_parse_vardecl | + | |
| - | + | ||
| - | CMP AL, TOK_IDENT | + | |
| - | JZ @cc_parse_assign | + | |
| - | + | ||
| - | CMP AL, TOK_IF | + | |
| - | JZ @cc_parse_if | + | |
| - | + | ||
| - | ; Unknown statement | + | |
| - | LDAL #1 | + | |
| - | STAL [@CC_ERROR] | + | |
| - | LDELM @cc_str_expect_stmt | + | |
| - | LDAH $18 | + | |
| - | INT $10 | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | Finally here is the if parser. Note three important changes from BASIC. | + | |
| - | + | ||
| - | # Instead of manually checking for things like ) the system uses a TOKEN table (TOK_LPAREN, | + | |
| - | # You don't see int05_skip_space everywhere. Instead we have a cc_skip_ws which is included in cc_next_token. This isn't as ' | + | |
| - | # It's an emitter not an interpreter. Instead of doing it " | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ; cc_parse_if | + | |
| - | ; ' | + | |
| - | ; | + | |
| - | ; Emitted code shape (if only): | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; JZ patch1 | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; | + | |
| - | ; Emitted code shape (if/ | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; JZ patch1 | + | |
| - | ; < | + | |
| - | ; JMP patch2 | + | |
| - | ; | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; | + | |
| - | cc_parse_if: | + | |
| - | ; Consume ' | + | |
| - | CALL @cc_next_token | + | |
| - | + | ||
| - | ; Expect ' | + | |
| - | LDAL TOK_LPAREN | + | |
| - | CALL @cc_expect | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Parse condition expression (result in A) | + | |
| - | CALL @cc_parse_expr | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Expect ' | + | |
| - | LDAL TOK_RPAREN | + | |
| - | CALL @cc_expect | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Emit: CMP_IMM A, #0 (opcode=91, reg=0, imm16=0, | + | |
| - | LDAL #91 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; REG_A | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; imm16 lo = 0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; imm16 hi = 0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: JZ < | + | |
| - | LDAL #101 ; JZ opcode | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Save CC_HERE — this is where the 3-byte address will go | + | |
| - | LDBLX [@CC_HERE] | + | |
| - | PUSH BLX ; patch1 address saved on stack | + | |
| - | + | ||
| - | ; Emit 3 placeholder bytes | + | |
| - | LDAL #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Parse then-block | + | |
| - | CALL @cc_parse_block | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_bail1 | + | |
| - | + | ||
| - | ; Check for ' | + | |
| - | LDAL [@CC_TOKEN_TYPE] | + | |
| - | CMP AL, TOK_ELSE | + | |
| - | JZ @cc_pi_else | + | |
| - | + | ||
| - | ; No else = patch1 = CC_HERE (jump past then-block) | + | |
| - | POP BLX ; BLX = patch1 location | + | |
| - | CALL @cc_backpatch | + | |
| - | JMP @cc_pi_done | + | |
| - | + | ||
| - | cc_pi_else: | + | |
| - | ; Consume ' | + | |
| - | CALL @cc_next_token | + | |
| - | + | ||
| - | ; Emit: JMP < | + | |
| - | LDAL #100 ; JMP opcode | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDBLY [@CC_HERE] | + | |
| - | PUSH BLY ; patch2 address saved on stack | + | |
| - | LDAL #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; We need patch1 now. Swap: | + | |
| - | POP BLY ; BLY = patch2 | + | |
| - | POP BLX ; BLX = patch1 | + | |
| - | PUSH BLY ; re-push patch2 | + | |
| - | + | ||
| - | CALL @cc_backpatch | + | |
| - | + | ||
| - | ; Parse else-block | + | |
| - | CALL @cc_parse_block | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_bail1 | + | |
| - | + | ||
| - | ; Patch2 = CC_HERE (JMP lands past else-block) | + | |
| - | POP BLX ; BLX = patch2 location | + | |
| - | CALL @cc_backpatch | + | |
| - | + | ||
| - | cc_pi_done: | + | |
| - | RET | + | |
| - | + | ||
| - | cc_pi_bail1: | + | |
| - | POP BLX ; discard patch address | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | For a backpatch helper (in case we need this elsewhere) we just copy HERE: | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | + | |
| - | ;; backpatch helper | + | |
| - | cc_backpatch: | + | |
| - | PUSH ELM | + | |
| - | LDELM [@CC_HERE] | + | |
| - | STELM [FLD] | + | |
| - | POP ELM | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | == while is easy | + | |
| - | Once we have backpatching and the expression machinery, every new control flow construct is just a new arrangement of jumps. '' | + | |
| - | + | ||
| - | == Functions and Pointers and Etc | + | |
| - | To make a long story short I started to feel the pain of a 16 bit symbol table as I had already done a significant amount of work and I was in no mood to rewrite | + | |
| - | + | ||
| - | So I came up with the brilliant idea of just making everything in bank 0. | + | |
| - | + | ||
| - | == Final Boss: Strings | + | |
| - | Strings were the final boss. I am still working on them, kind of. Just fixed some errors this morning. | + | |
| - | + | ||
| - | Adding strings was easy after char *. Same process: add TOK_CHAR. cc_kw_char. etc. It was the usual suspects for bugs again too. The usual pointer clobber bugs. | + | |
| - | + | ||
| - | === Jumping past strings | + | |
| - | The key is to jump past the string. The data lives inside the code. | + | |
| - | + | ||
| - | char *s = " | + | |
| - | + | ||
| - | When we emit the code here we will add a jmp, insert the string, then set s to the address of the string. This will allow us to reference the string like an array (s[6]) and so forth. Buffer overflows! Ahh so that's where this stuff comes from. | + | |
| - | + | ||
| - | s[6] = 133; call opcode; now we're in business! | + | |
| - | + | ||
| - | == Moving the Stack | + | |
| - | I eventually had to move the stack because the stack is in bank 1 and this means ''& | + | |
| - | + | ||
| - | int main(void) { | + | |
| - | char *msg = " | + | |
| - | return msg[1]; | + | |
| - | } | + | |
| - | + | ||
| - | The solution is to move the SP inside the C REPL only. While we are inside we will do something like: | + | |
| - | + | ||
| - | MOV GLZ, SP | + | |
| - | MOV SP, $FFFF | + | |
| - | PUSH GLZ | + | |
| - | + | ||
| - | <program code here> | + | |
| - | + | ||
| - | POP SP | + | |
| - | RET | + | |
| - | + | ||
| - | + | ||
| - | == Built-in functions | + | |
| - | Once strings worked I added putchar() as a built-in | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ; Built-in function emitter. | + | |
| - | ; Emits machine code for built-in functions and registers them in the symbol table. | + | |
| - | cc_emit_builtins: | + | |
| - | CALL @cc_builtin_putchar | + | |
| - | RET | + | |
| - | + | ||
| - | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | + | |
| - | ; putchar(c) | + | |
| - | ; Prints character c to terminal via INT $10 AH=$17 | + | |
| - | cc_builtin_putchar: | + | |
| - | ; Register in symbol table | + | |
| - | LDA [@CC_HERE] | + | |
| - | MOV B, A | + | |
| - | LDELM @cc_bi_putchar | + | |
| - | + | ||
| - | LDAL #0 ; type = 0 (int) | + | |
| - | LDAH #2 ; kind = 2 (function) | + | |
| - | CALL @cc_sym_add | + | |
| - | + | ||
| - | ; Emit prologue | + | |
| - | CALL @cc_emit_push_glz | + | |
| - | CALL @cc_emit_mov_glz_sp | + | |
| - | + | ||
| - | ; Emit: LDA [GLZ+6] (first arg) | + | |
| - | LDA #210 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #94 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #6 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: LDAH $17 (teletype putchar) | + | |
| - | LDA #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #40 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA $17 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: INT $10 | + | |
| - | LDA #134 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA $10 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit epilogue | + | |
| - | CALL @cc_emit_mov_sp_glz | + | |
| - | CALL @cc_emit_pop_glz | + | |
| - | CALL @cc_emit_ret | + | |
| - | RET | + | |
| - | + | ||
| - | + | ||
| - | ; --- Name strings --- | + | |
| - | cc_bi_putchar: | + | |
| - | .bytes " | + | |
| - | </ | + | |
| - | + | ||
| - | After this, I added getchar(), halt() and yield(), and then things started getting really easy, for example print() is just: | + | |
| - | + | ||
| - | <codify C> | + | |
| - | int print(char *s) { | + | |
| - | int i = 0; | + | |
| - | while (s[i] != 0) { | + | |
| - | putchar(s[i]); | + | |
| - | i = i + 1; | + | |
| - | } | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | Wow, so we're writing our library in C now. Things are starting to go well from here! | + | |
| - | + | ||
| - | * [[List of TinyC built-in functions]] | + | |
| - | == Evil Stack Bugs Strike Again | + | === Language semantics |
| - | While not technically a bug, there's a ' | + | * Pointer arithmetic doesn't scale by element size: int *p; p + 1 adds 1 byte, not 3. |
| + | * sizeof would be needed to write portable code; absent. | ||
| + | * Char arithmetic doesn' | ||
| + | * No automatic conversions, | ||
| - | while (condition) { | + | == Roadmap |
| - | .... | + | This is not a roadmap. |
| - | int c = msg[i]; | + | |
| - | .... | + | |
| - | } | + | |
| - | This emits '' | + | === Easy |
| + | * Bitwise operators. | ||
| + | * ++/-- and compound assignment. | ||
| + | * % operator; opcode already exists | ||
| + | * Pre-processor. Separate pass before lexing; probably harder than it sounds. | ||
| - | == break/continue notes | + | === Medium |
| - | # PUSH outer break count (2 bytes) | + | * &&/|| with short-circuit. Needs branch generation. |
| - | # PUSH outer loop top (3 bytes) | + | * Local arrays. int arr[10] allocates 30 bytes on stack, needs frame offset adjustment and array-as-pointer-decay. |
| - | # PUSH loop_top for JMP back (3 bytes) | + | |
| - | # PUSH patch_exit (3 bytes) | + | |
| - | # POP patch_exit for JMP emit, re-push | + | |
| - | # POP loop_top into ILY | + | |
| - | # POP patch_exit for backpatch | + | |
| - | # POP outer loop top, restore | + | |
| - | # POP outer break count, restore | + | |
| + | === Hard/ | ||
| + | * Initialized globals. This is, unfortunately, | ||
| + | * Standard library subset -- strlen, strcmp, memcpy, printf (formatted output is a project of its own). | ||
| + | * Type checking. Proper type expression evaluation, conversion rules. | ||
| + | * Structs. Type representation grows from 1 byte to multi-byte type descriptor. | ||
sd/tinyc_for_the_sd-8516.1776406914.txt.gz · Last modified: by appledog
