sd:tinyc_for_the_sd-8516
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| sd:tinyc_for_the_sd-8516 [2026/04/17 07:09] – appledog | sd:tinyc_for_the_sd-8516 [2026/06/05 16:53] (current) – appledog | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| == Introduction | == Introduction | ||
| - | This TinyC is more like the " | + | < |
| - | Current Feature Set: | + | TinyC refers to V1 (tinyc.asm) and V2 (tinyc.c), which is the same as V1 but written in C. V1 is intended to compile V2, and then we have self-hosting C. Thus, this TinyC should be more like the " |
| + | |||
| + | V1 Current Feature Set: | ||
| * Expressions with precedence | * Expressions with precedence | ||
| * Local variables | * Local variables | ||
| Line 24: | Line 26: | ||
| This is enough that I was able to start writing functions in C like abs, min, max, strlen, strcmp, atoi/itoa, and so on. | This is enough that I was able to start writing functions in C like abs, min, max, strlen, strcmp, atoi/itoa, and so on. | ||
| + | |||
| + | == Function Library | ||
| + | * [[List of TinyC built-in functions]] | ||
| == History | == History | ||
| - | A self-hosting C compiler had been my goal for quite some time. But I didn't know how to write one. I hacked out a BASIC first, then followed some guides and source code to get a Forth running. But C eluded me for several months. Eventually I was able to leverage what I had learned doing BASIC to get a parser that could compile int main(void) { return 0; } and emitted LDA #0 and RET. The debugger was simply printing the return value. | + | A self-hosting C compiler had been my goal for quite some time. But I didn't know how to write one. I hacked out a BASIC first, then followed some guides and source code to get a Forth running. But C eluded me for several months. Eventually I was able to leverage what I had learned doing BASIC to get a parser that could compile int main(void) { return 0; } and emitted LDA #0 and RET. The debugger was simply printing the return value. The rest of the story is found in the [[TinyC Developer Diary]]. Mainly so I don't forget what I learned, as this project has gotten quite big. |
| - | === Expression Parser | + | == Issues |
| - | I did expressions first because | + | The big issue is that I don't //really// know what I am doing. Meaning, it' |
| - | === Simplest explanation possible | + | What is even better |
| - | The simplest and easiest way is to imagine everything has brackets. So an expression like 2 + 3 * 4 is required to be in the form (2 + (3 * 4)). What the parser has to do is evaluate every () expression. This is done recursively and is the recursive part of " | + | |
| - | < | + | === Compile to disk |
| + | The other solution I am mulling is to rewrite | ||
| - | | + | == What's Missing? |
| + | * Preprocessor: | ||
| + | Structs and unions: cannot define composite types. No member access (., ->). | ||
| + | * Typedef: no type aliases. | ||
| + | * Multi-dimensional arrays and array declarations: | ||
| + | * Initializers: | ||
| + | * String literals as data: strings work only as immediate operands, not as initialized arrays. | ||
| + | * Function pointers: no syntax, no calling convention. | ||
| + | * Variable arguments: no ..., no va_list. | ||
| + | * enum: no constant enumeration. | ||
| + | * static, extern, register, volatile, const: storage class and qualifier keywords missing. | ||
| + | * Multiple return types: only int (well, 24-bit). No void, char, short, long, float, double returns. | ||
| + | * void type: parsed in (void) parameter list only; can't declare void variables, return void, or use void | ||
| - | became | + | === Operator gaps |
| + | * Bitwise ops are critical for systems programming. Without them, no masking, no flag manipulation. | ||
| + | * Logical && and || with short-circuit semantics matter for correctness, | ||
| + | * ++/-- are syntactic sugar but their absence is glaring. | ||
| + | * % (modulo) for any non-trivial arithmetic. | ||
| + | * Compound assignment is just sugar but expected. | ||
| - | (2 + (3 * 4)) | + | === Library |
| + | * No standard library at all. No printf, malloc, strcmp, memcpy, strlen. Programs must be self-contained or call only the six builtins. | ||
| + | * No file I/O from compiled programs (peek/poke don't qualify). | ||
| + | * No dynamic memory. | ||
| - | Now each bracket has a left and a right and an operand. One operation. So let's re-write this as parameters to function calls, because thats how computers work (ex. '' | + | === Compiler quality |
| - | + | * No optimization passes. Every operation goes through BLA via the stack. | |
| - | (+ 2 (* 3 4)) | + | * No type checking. Assigning char * to int is silent. |
| - | + | * No warnings | |
| - | And lets use English to name the functions, as we agreed earlier: | + | * Error messages |
| - | + | * No symbol scoping beyond | |
| - | (add 2 (mul 3 4)) | + | |
| - | + | ||
| - | Now let's move the bracket to after the function name, which follows principles of English language (ex. we say "I am not hungry" | + | |
| - | + | ||
| - | add(2 mul(3 4)) | + | |
| - | + | ||
| - | Now separate parameters with commas, which is another principle of English: | + | |
| - | + | ||
| - | add(2, mul(3,4)) | + | |
| - | + | ||
| - | This is the link between Stack Machines, Forth, LISP, BASIC and C. C is that moment where we deal with things in a way that makes sense in English, but they' | + | |
| - | </ | + | |
| - | + | ||
| - | So the first kind of parser you should write has a pre-processor that encases everyting in parentheses. Then brute force it; just scan for ()'s and emit the innermost functions first. It would end up looking like (2 + ((3 * 4) / 5)). This makes parsing robotic and easy. | + | |
| - | + | ||
| - | There' | + | |
| - | + | ||
| - | Everything which follows is merely a refactor of this exact technique going back to the early days of 1950s computer science when all this stuff was originally invented. | + | |
| - | + | ||
| - | === Three-Level Descent | + | |
| - | The expression parser we use is exactly similar to the above but does it using special rules. It evaluates the left term, then finds an operator, then parses the right side down to the next level. This way, higher precedence operators bind their operands before lower-precedence operators get a chance to. As it recursively unravels, everything eventually gets parsed and evaluated. | + | |
| - | + | ||
| - | ^ Level ^ Function ^ Operators ^ Calls Down To ^ | + | |
| - | | Comparison | '' | + | |
| - | | Additive | '' | + | |
| - | | Multiplicative | ''' | + | |
| - | | Primary | '' | + | |
| - | + | ||
| - | === Example; 2 + 3 * 4 | + | |
| - | * 1. '' | + | |
| - | * 2. '' | + | |
| - | * 3. '' | + | |
| - | * 4. '' | + | |
| - | * 5. '' | + | |
| - | * 6. '' | + | |
| - | * 7. '' | + | |
| - | * 8. '' | + | |
| - | * 9. '' | + | |
| - | * 10. '' | + | |
| - | * 11. '' | + | |
| - | * 12. '' | + | |
| - | * 13. '' | + | |
| - | + | ||
| - | 2 + 3 * 4 ---> | + | |
| - | + | ||
| - | The parser puts the 2 in A then evaluates whatever is in the parentheses, | + | |
| - | + | ||
| - | The key insight comes from step 4: '' | + | |
| - | + | ||
| - | === Emits | + | |
| - | Every operator follows the same code generation template. The left operand is already in register A (from the recursive call). The compiler saves it, evaluates the right operand (which also lands in A), then combines them: | + | |
| - | + | ||
| - | PUSH A ; save left operand on stack | + | |
| - | <code here> | + | |
| - | MOV B, A ; move right operand to B | + | |
| - | POP A ; restore left operand to A | + | |
| - | ADD A, B ; A = left + right (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | This looks inefficient but remember each individual lego block puts it's result into A. So we need to move A around a bit. Actually we can do this: | + | |
| - | + | ||
| - | MOV B, A ; move left operand to B | + | |
| - | <code here> | + | |
| - | ADD A, B ; A = right plus left (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | With a little extra code we can emit into B. A later optimization perhaps. | + | |
| - | + | ||
| - | <code here> | + | |
| - | ADD A, B ; A = left + right (or SUB, MUL, DIV) | + | |
| - | + | ||
| - | This pattern is uniform across all arithmetic operators. The only difference is the final instruction: | + | |
| - | + | ||
| - | === From the C version | + | |
| - | The additive level in the C version illustrates the pattern: | + | |
| - | + | ||
| - | <codify C> | + | |
| - | int parse_additive() { | + | |
| - | parse_term(); | + | |
| - | if (g_error) { return 0; } | + | |
| - | while (1) { | + | |
| - | if (g_tok_type == TOK_PLUS) { | + | |
| - | next_token(); | + | |
| - | emit_push_a(); | + | |
| - | parse_term(); | + | |
| - | emit_mov_b_a(); | + | |
| - | emit_pop_a(); | + | |
| - | emit_add_a_b(); | + | |
| - | } else if (g_tok_type == TOK_MINUS) { | + | |
| - | next_token(); | + | |
| - | emit_push_a(); | + | |
| - | parse_term(); | + | |
| - | emit_mov_b_a(); | + | |
| - | emit_pop_a(); | + | |
| - | emit_sub_a_b(); | + | |
| - | } else { | + | |
| - | break; | + | |
| - | } | + | |
| - | } | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | The multiplicative level is structurally identical, with '' | + | |
| - | + | ||
| - | + | ||
| - | === The B in BEDMAS | + | |
| - | Brackets in expressions (parentheses) override precedence by recursing back to the top level. When '' | + | |
| - | + | ||
| - | <codify C> | + | |
| - | // Inside parse_primary: | + | |
| - | if (g_tok_type == TOK_LPAREN) { | + | |
| - | next_token(); | + | |
| - | parse_expr(); | + | |
| - | expect(TOK_RPAREN); | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | === Conclusion | + | |
| - | The parser turns the whole (expression) into one number. Either as parse_expr on the whole thing, or as parse_expr on the whole thing with brackets. The expression parser is a big simplification loop. Chunk by chunk, we simplify everything as if it was surrounded by parentheses. The big reveal is that all of this started by writing a parser that pre-processed expressions by surrounding them with brackets. So if you are having any trouble understanding an expression parser just go back to surrounding everything in parentheses and work from there. It will work exactly the same. | + | |
| - | + | ||
| - | === Comparisons (<, ==, >) | + | |
| - | Comparisons sit at the lowest precedence level, handled by '' | + | |
| - | + | ||
| - | The emitted code for '' | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | (load a value into A) | + | |
| - | LDB #5 ; load value to check against | + | |
| - | CMP A, B ; sets CPU flags | + | |
| - | JZ true_case | + | |
| - | LDA #0 ; false | + | |
| - | JMP done | + | |
| - | + | ||
| - | true_case: | + | |
| - | LDA #1 ; true | + | |
| - | done: | + | |
| - | ; A = 0 or 1 | + | |
| - | </ | + | |
| - | + | ||
| - | === Symbol table | + | |
| - | Arguably the symbol table is the core milestone. I had been working on various variable storage systems until this, and this is, I think, the best one so far. | + | |
| - | + | ||
| - | All I wanted from a variable system was the ability to map a string to a string. That was the core idea. If I could do that I could have a variable system. So when the programmer writes '' | + | |
| - | + | ||
| - | But in BASIC, every time a line executes, the interpreter searches for the variable by name. From the beginning. This was very slow. A compiler is different: it resolves names to locations once during compilation and bakes the result into the machine code. The compiled program never sees the symbol table. That's the C way. Compiled, not interpreted. The idea comes directly from the javascript assembler. It inserts dummy values for pointers until the labels can be resolved. AKA back-patching. | + | |
| - | + | ||
| - | === Symbol table record structure | + | |
| - | The symbol table is a flat array of fixed-size records. I was worried | + | |
| - | + | ||
| - | ^ Offset ^ Size ^ Field ^ Description | | + | |
| - | | 0–15 | 16 | Name | Null-terminated, | + | |
| - | | 16 | 1 | Type | 0=int, 1=int*, 2=char, 3=char* | | + | |
| - | | 17 | 1 | Kind | 0=local, 1=param, 2=function, 3=global | | + | |
| - | | 18–19 | 2 | Value | 16-bit: stack offset (locals/ | + | |
| - | + | ||
| - | Note: name size was increased to 15 from 7 because some function names like get_thing() were larger than 7 characters. If we run out of space change this first. | + | |
| - | + | ||
| - | === How it works | + | |
| - | The big reveal is that it doesn' | + | |
| - | + | ||
| - | Three operations manage the table, like NVAR_SET, NVAR_GET and NVAR_CLEAR, here we use SVAR_SET (sym_add), SVAR_GET (sym_find) and SVAR_CLEAR (sym_clear). | + | |
| - | + | ||
| - | * SVAR_SET/ | + | |
| - | ** Appends a new entry. Needs name, and value like NVAR but additionally needs type (ex int, char) and kind (ex function, global, param). | + | |
| - | + | ||
| - | * SVAR_GET/ | + | |
| - | ** Performs a linear scan from the first entry, comparing each stored name against the search string using '' | + | |
| - | + | ||
| - | * SVAR_CLEAR/ | + | |
| - | ** resets the count to zero. Just write a zero at the start to show no variables. | + | |
| - | + | ||
| - | === Scoping using the stack | + | |
| - | The compiler implements function-level scoping by saving the values on the stack and only keeping a handle to them in the symbol table. When parsing a function definition, the current symbol count is saved into '' | + | |
| - | + | ||
| - | === Type vs Kind | + | |
| - | * Kind 0 = Local Variable | + | |
| - | ** A positive integer representing the distance below the frame pointer (GLZ). | + | |
| - | + | ||
| - | * Kind 1 = Params | + | |
| - | ** A positive integer representing the distance above the frame pointer. Accessed as '' | + | |
| - | + | ||
| - | * Kind 2 = Functions | + | |
| - | ** The 16-bit code address where the function's compiled body begins. Used by '' | + | |
| - | + | ||
| - | * Kind 3 = Globals | + | |
| - | ** A fixed memory address in bank 0, allocated from a bump pointer starting at '' | + | |
| - | + | ||
| - | === Compiling '' | + | |
| - | # '' | + | |
| - | # '' | + | |
| - | # It computes the slot address: '' | + | |
| - | # Writes " | + | |
| - | # Increments '' | + | |
| - | + | ||
| - | Next for '' | + | |
| - | + | ||
| - | # '' | + | |
| - | # Slot address: $00F040 + (1 * 12) = $00F04C | + | |
| - | # Writes " | + | |
| - | + | ||
| - | When '' | + | |
| - | + | ||
| - | #. '' | + | |
| - | #. Compares each record' | + | |
| - | #. Finds " | + | |
| - | #. The caller reads the value field at BLY+10, gets 2 | + | |
| - | #. Emits '' | + | |
| - | + | ||
| - | When a new '' | + | |
| - | + | ||
| - | '' | + | |
| - | + | ||
| - | The table is purely compile-time. The compiled program has no idea it exists. Variable names are resolved to numeric stack offsets during compilation and baked into the emitted instructions. | + | |
| - | + | ||
| - | === But why stack frames? | + | |
| - | The stack is used because | + | |
| - | + | ||
| - | When '' | + | |
| - | + | ||
| - | PUSH GLZ ; save old frame pointer (3 bytes on stack) | + | |
| - | MOV GLZ, SP ; GLZ = current stack top. This is our reference point | + | |
| - | SUB SP, #2 ; reserve 2 bytes for ' | + | |
| - | SUB SP, #2 ; reserve 2 bytes for ' | + | |
| - | + | ||
| - | The stack now looks like this (growing downward): | + | |
| - | + | ||
| - | ┌───────────────────────┐ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | GLZ points | + | |
| - | ├───────────────────────┤ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | | + | |
| - | ├───────────────────────┤ | + | |
| - | SP points | + | |
| - | └───────────────────────┘ | + | |
| - | + | ||
| - | '' | + | |
| - | + | ||
| - | But why male models? | + | |
| - | + | ||
| - | (But why on the stack?) It boils down to a memory management issue. You could assign '' | + | |
| - | + | ||
| - | int factorial(int n) { | + | |
| - | if (n == 0) return 1; | + | |
| - | return n * factorial(n - 1); | + | |
| - | } | + | |
| - | + | ||
| - | If '' | + | |
| - | + | ||
| - | The solution is, every call needs its own private copy of '' | + | |
| - | + | ||
| - | == Working through if-else | + | |
| - | After the symbol table, everything else started to fall into place quickly; if was next, and I already had the concept from the expression parser in BASIC. | + | |
| - | + | ||
| - | By this time the language structure was starting to become established so I will write down how I added if as a guide for other keywords. | + | |
| - | + | ||
| - | To get a new language feature like '' | + | |
| - | + | ||
| - | # Add it to the lexer (match keyword). | + | |
| - | # Parse the statement (done in the executor in BASIC, but a separate step here). | + | |
| - | # Emitter -- This is the big diff, since we don't do it ' | + | |
| - | + | ||
| - | The plan is the exact same as assembly (labels); back-patching, | + | |
| - | + | ||
| - | # Emit the condition (cc_parse_expr, | + | |
| - | # Emit CMP A, #0 | + | |
| - | # Emit JZ with a placeholder address. Save CC_HERE before writing it | + | |
| - | # Parse the then-block (emits its code) | + | |
| - | # Go back and patch the JZ address to point here (past the block) | + | |
| - | + | ||
| - | for if-else: | + | |
| - | + | ||
| - | # Emit condition (same) | + | |
| - | # Emit CMP A, #0 (same) | + | |
| - | # JZ placeholder (same) | + | |
| - | # Parse then-block (same) | + | |
| - | # Emit JMP placeholder (skip over else) | + | |
| - | # Patch placeholder to here | + | |
| - | # Parse else-block | + | |
| - | # Patch placeholder to here | + | |
| - | + | ||
| - | ==== Documenting the process | + | |
| - | Just like adding a keyword in BASIC. | + | |
| - | + | ||
| - | 1. Add the tokens. | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | .equ TOK_IF | + | |
| - | .equ TOK_ELSE | + | |
| - | </ | + | |
| - | + | ||
| - | 2. Add the strings. | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_kw_if: | + | |
| - | .bytes " | + | |
| - | + | ||
| - | cc_kw_else: | + | |
| - | .bytes " | + | |
| - | </ | + | |
| - | + | ||
| - | 3. Add the keyword checks in the keyword checks section. | + | |
| - | <codify armasm> | + | |
| - | ; Check " | + | |
| - | LDELM CC_TOK_BUF | + | |
| - | LDFLD @cc_kw_if | + | |
| - | LDAH $02 | + | |
| - | INT $12 | + | |
| - | JZ @cc_nt_kw_if | + | |
| - | + | ||
| - | ; Check " | + | |
| - | LDELM CC_TOK_BUF | + | |
| - | LDFLD @cc_kw_else | + | |
| - | LDAH $02 | + | |
| - | INT $12 | + | |
| - | JZ @cc_nt_kw_else | + | |
| - | </ | + | |
| - | + | ||
| - | And the handlers, | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_nt_kw_if: | + | |
| - | LDAL TOK_IF | + | |
| - | STAL [@CC_TOKEN_TYPE] | + | |
| - | POP I | + | |
| - | RET | + | |
| - | + | ||
| - | cc_nt_kw_else: | + | |
| - | LDAL TOK_ELSE | + | |
| - | STAL [@CC_TOKEN_TYPE] | + | |
| - | POP I | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | We have to add it to the parser: | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | cc_parse_stmt: | + | |
| - | LDAL [@CC_TOKEN_TYPE] | + | |
| - | + | ||
| - | CMP AL, TOK_RETURN | + | |
| - | JZ @cc_parse_return | + | |
| - | + | ||
| - | CMP AL, TOK_INT | + | |
| - | JZ @cc_parse_vardecl | + | |
| - | + | ||
| - | CMP AL, TOK_IDENT | + | |
| - | JZ @cc_parse_assign | + | |
| - | + | ||
| - | CMP AL, TOK_IF | + | |
| - | JZ @cc_parse_if | + | |
| - | + | ||
| - | ; Unknown statement | + | |
| - | LDAL #1 | + | |
| - | STAL [@CC_ERROR] | + | |
| - | LDELM @cc_str_expect_stmt | + | |
| - | LDAH $18 | + | |
| - | INT $10 | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | Finally here is the if parser. Note three important changes from BASIC. | + | |
| - | + | ||
| - | # Instead of manually checking for things like ) the system uses a TOKEN table (TOK_LPAREN, | + | |
| - | # You don't see int05_skip_space everywhere. Instead we have a cc_skip_ws which is included in cc_next_token. This isn't as ' | + | |
| - | # It's an emitter not an interpreter. Instead of doing it " | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ; cc_parse_if | + | |
| - | ; ' | + | |
| - | ; | + | |
| - | ; Emitted code shape (if only): | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; JZ patch1 | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; | + | |
| - | ; Emitted code shape (if/ | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; JZ patch1 | + | |
| - | ; < | + | |
| - | ; JMP patch2 | + | |
| - | ; | + | |
| - | ; < | + | |
| - | ; | + | |
| - | ; | + | |
| - | cc_parse_if: | + | |
| - | ; Consume ' | + | |
| - | CALL @cc_next_token | + | |
| - | + | ||
| - | ; Expect ' | + | |
| - | LDAL TOK_LPAREN | + | |
| - | CALL @cc_expect | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Parse condition expression (result in A) | + | |
| - | CALL @cc_parse_expr | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Expect ' | + | |
| - | LDAL TOK_RPAREN | + | |
| - | CALL @cc_expect | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_done | + | |
| - | + | ||
| - | ; Emit: CMP_IMM A, #0 (opcode=91, reg=0, imm16=0, | + | |
| - | LDAL #91 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; REG_A | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; imm16 lo = 0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDAL #0 ; imm16 hi = 0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: JZ < | + | |
| - | LDAL #101 ; JZ opcode | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Save CC_HERE — this is where the 3-byte address will go | + | |
| - | LDBLX [@CC_HERE] | + | |
| - | PUSH BLX ; patch1 address saved on stack | + | |
| - | + | ||
| - | ; Emit 3 placeholder bytes | + | |
| - | LDAL #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Parse then-block | + | |
| - | CALL @cc_parse_block | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_bail1 | + | |
| - | + | ||
| - | ; Check for ' | + | |
| - | LDAL [@CC_TOKEN_TYPE] | + | |
| - | CMP AL, TOK_ELSE | + | |
| - | JZ @cc_pi_else | + | |
| - | + | ||
| - | ; No else = patch1 = CC_HERE (jump past then-block) | + | |
| - | POP BLX ; BLX = patch1 location | + | |
| - | CALL @cc_backpatch | + | |
| - | JMP @cc_pi_done | + | |
| - | + | ||
| - | cc_pi_else: | + | |
| - | ; Consume ' | + | |
| - | CALL @cc_next_token | + | |
| - | + | ||
| - | ; Emit: JMP < | + | |
| - | LDAL #100 ; JMP opcode | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDBLY [@CC_HERE] | + | |
| - | PUSH BLY ; patch2 address saved on stack | + | |
| - | LDAL #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; We need patch1 now. Swap: | + | |
| - | POP BLY ; BLY = patch2 | + | |
| - | POP BLX ; BLX = patch1 | + | |
| - | PUSH BLY ; re-push patch2 | + | |
| - | + | ||
| - | CALL @cc_backpatch | + | |
| - | + | ||
| - | ; Parse else-block | + | |
| - | CALL @cc_parse_block | + | |
| - | LDAL [@CC_ERROR] | + | |
| - | CMP AL, #0 | + | |
| - | JNZ @cc_pi_bail1 | + | |
| - | + | ||
| - | ; Patch2 = CC_HERE (JMP lands past else-block) | + | |
| - | POP BLX ; BLX = patch2 location | + | |
| - | CALL @cc_backpatch | + | |
| - | + | ||
| - | cc_pi_done: | + | |
| - | RET | + | |
| - | + | ||
| - | cc_pi_bail1: | + | |
| - | POP BLX ; discard patch address | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | For a backpatch helper (in case we need this elsewhere) we just copy HERE: | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | + | |
| - | ;; backpatch helper | + | |
| - | cc_backpatch: | + | |
| - | PUSH ELM | + | |
| - | LDELM [@CC_HERE] | + | |
| - | STELM [FLD] | + | |
| - | POP ELM | + | |
| - | RET | + | |
| - | </ | + | |
| - | + | ||
| - | == while is easy | + | |
| - | Once we have backpatching and the expression machinery, every new control flow construct is just a new arrangement of jumps. '' | + | |
| - | + | ||
| - | == Functions and Pointers and Etc | + | |
| - | To make a long story short I started to feel the pain of a 16 bit symbol table as I had already done a significant amount of work and I was in no mood to rewrite | + | |
| - | + | ||
| - | So I came up with the brilliant idea of just making everything in bank 0. | + | |
| - | + | ||
| - | == Final Boss: Strings | + | |
| - | Strings were the final boss. I am still working on them, kind of. Just fixed some errors this morning. | + | |
| - | + | ||
| - | Adding strings was easy after char *. Same process: add TOK_CHAR. cc_kw_char. etc. It was the usual suspects for bugs again too. The usual pointer clobber bugs. | + | |
| - | + | ||
| - | === Jumping past strings | + | |
| - | The key is to jump past the string. The data lives inside the code. | + | |
| - | + | ||
| - | char *s = " | + | |
| - | + | ||
| - | When we emit the code here we will add a jmp, insert the string, then set s to the address of the string. This will allow us to reference the string like an array (s[6]) and so forth. Buffer overflows! Ahh so that's where this stuff comes from. | + | |
| - | + | ||
| - | s[6] = 133; call opcode; now we're in business! | + | |
| - | + | ||
| - | == Moving the Stack | + | |
| - | I eventually had to move the stack because the stack is in bank 1 and this means ''& | + | |
| - | + | ||
| - | int main(void) { | + | |
| - | char *msg = " | + | |
| - | return msg[1]; | + | |
| - | } | + | |
| - | + | ||
| - | The solution is to move the SP inside the C REPL only. While we are inside we will do something like: | + | |
| - | + | ||
| - | MOV GLZ, SP | + | |
| - | MOV SP, $FFFF | + | |
| - | PUSH GLZ | + | |
| - | + | ||
| - | <program code here> | + | |
| - | + | ||
| - | POP SP | + | |
| - | RET | + | |
| - | + | ||
| - | + | ||
| - | == Built-in functions | + | |
| - | Once strings worked I added putchar() as a built-in | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | ; Built-in function emitter. | + | |
| - | ; Emits machine code for built-in functions and registers them in the symbol table. | + | |
| - | cc_emit_builtins: | + | |
| - | CALL @cc_builtin_putchar | + | |
| - | RET | + | |
| - | + | ||
| - | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | + | |
| - | ; putchar(c) | + | |
| - | ; Prints character c to terminal via INT $10 AH=$17 | + | |
| - | cc_builtin_putchar: | + | |
| - | ; Register in symbol table | + | |
| - | LDA [@CC_HERE] | + | |
| - | MOV B, A | + | |
| - | LDELM @cc_bi_putchar | + | |
| - | + | ||
| - | LDAL #0 ; type = 0 (int) | + | |
| - | LDAH #2 ; kind = 2 (function) | + | |
| - | CALL @cc_sym_add | + | |
| - | + | ||
| - | ; Emit prologue | + | |
| - | CALL @cc_emit_push_glz | + | |
| - | CALL @cc_emit_mov_glz_sp | + | |
| - | + | ||
| - | ; Emit: LDA [GLZ+6] (first arg) | + | |
| - | LDA #210 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #94 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #6 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: LDAH $17 (teletype putchar) | + | |
| - | LDA #0 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA #40 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA $17 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit: INT $10 | + | |
| - | LDA #134 | + | |
| - | CALL @cc_emit_byte | + | |
| - | LDA $10 | + | |
| - | CALL @cc_emit_byte | + | |
| - | + | ||
| - | ; Emit epilogue | + | |
| - | CALL @cc_emit_mov_sp_glz | + | |
| - | CALL @cc_emit_pop_glz | + | |
| - | CALL @cc_emit_ret | + | |
| - | RET | + | |
| - | + | ||
| - | + | ||
| - | ; --- Name strings --- | + | |
| - | cc_bi_putchar: | + | |
| - | .bytes " | + | |
| - | </ | + | |
| - | + | ||
| - | After this, I added getchar(), halt() and yield(), and then things started getting really easy, for example print() is just: | + | |
| - | + | ||
| - | <codify C> | + | |
| - | int print(char *s) { | + | |
| - | int i = 0; | + | |
| - | while (s[i] != 0) { | + | |
| - | putchar(s[i]); | + | |
| - | i = i + 1; | + | |
| - | } | + | |
| - | return 0; | + | |
| - | } | + | |
| - | </ | + | |
| - | + | ||
| - | Wow, so we're writing our library in C now. Things are starting to go well from here! | + | |
| - | + | ||
| - | * [[List of TinyC built-in functions]] | + | |
| - | == Evil Stack Bugs Strike Again | + | === Language semantics |
| - | While not technically a bug, there's a ' | + | * Pointer arithmetic doesn't scale by element size: int *p; p + 1 adds 1 byte, not 3. |
| + | * sizeof would be needed to write portable code; absent. | ||
| + | * Char arithmetic doesn' | ||
| + | * No automatic conversions, | ||
| - | while (condition) { | + | == Roadmap |
| - | .... | + | This is not a roadmap. |
| - | int c = msg[i]; | + | |
| - | .... | + | |
| - | } | + | |
| - | This emits '' | + | === Easy |
| + | * Bitwise operators. | ||
| + | * ++/-- and compound assignment. | ||
| + | * % operator; opcode already exists | ||
| + | * Pre-processor. Separate pass before lexing; probably harder than it sounds. | ||
| - | == break/continue notes | + | === Medium |
| - | # PUSH outer break count (2 bytes) | + | * &&/|| with short-circuit. Needs branch generation. |
| - | # PUSH outer loop top (3 bytes) | + | * Local arrays. int arr[10] allocates 30 bytes on stack, needs frame offset adjustment and array-as-pointer-decay. |
| - | # PUSH loop_top for JMP back (3 bytes) | + | |
| - | # PUSH patch_exit (3 bytes) | + | |
| - | # POP patch_exit for JMP emit, re-push | + | |
| - | # POP loop_top into ILY | + | |
| - | # POP patch_exit for backpatch | + | |
| - | # POP outer loop top, restore | + | |
| - | # POP outer break count, restore | + | |
| + | === Hard/ | ||
| + | * Initialized globals. This is, unfortunately, | ||
| + | * Standard library subset -- strlen, strcmp, memcpy, printf (formatted output is a project of its own). | ||
| + | * Type checking. Proper type expression evaluation, conversion rules. | ||
| + | * Structs. Type representation grows from 1 byte to multi-byte type descriptor. | ||
sd/tinyc_for_the_sd-8516.1776409745.txt.gz · Last modified: by appledog
