User Tools

Site Tools


sd:tinyc_developer_diary

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
sd:tinyc_developer_diary [2026/04/17 14:12] – created appledogsd:tinyc_developer_diary [2026/04/20 04:58] (current) appledog
Line 37: Line 37:
 </blockquote> </blockquote>
  
-So the first kind of parser you should write has a pre-processor that encases everyting in parentheses. Then brute force it; just scan for ()'s and emit the innermost functions first. It would end up looking like (2 + ((3 * 4) / 5)). This makes parsing robotic and easy.+So the first kind of parser you should write has a pre-processor that encases everyting in parentheses. Then brute force it; just scan for ()'s and emit the innermost functions first. It would end up looking like <nowiki>(2 + ((3 * 4) / 5)).</nowiki> This makes parsing robotic and easy.
  
 There's nothing wrong with parsing like that. Its a good way to emit code. It's reliable and safe. There's nothing wrong with parsing like that. Its a good way to emit code. It's reliable and safe.
Line 675: Line 675:
 # POP outer break count, restore # POP outer break count, restore
  
 +
 +== V2: The Lexer
 +The lexer converts source text into tokens. It maintains a read position into the source buffer, provides one-character lookahead, skips whitespace and comments, and recognizes keywords, identifiers, numbers, strings, character literals, and operators.
 +
 +character functions:
 +<codify C>
 +int peek_char() {
 +    if (g_src_ptr >= g_src_end) {
 +        return 0;
 +    }
 +    char *p = g_src_ptr;
 +    return *p;
 +}
 +
 +int next_char() {
 +    if (g_src_ptr >= g_src_end) {
 +        return 0;
 +    }
 +    char *p = g_src_ptr;
 +    int c = *p;
 +    g_src_ptr = g_src_ptr + 1;
 +    return c;
 +}
 +
 +int skip_ws() {
 +    while (1) {
 +        int c = peek_char();
 +        if (c == ' ') { next_char(); }
 +        else if (c == 9) { next_char(); }
 +        else if (c == 10) { next_char(); }
 +        else if (c == 13) { next_char(); }
 +        else { return 0; }
 +    }
 +    return 0;
 +}
 +</codify>
 +
 +From here we will find tokens using next_token() and then go to a sub-lexer depending on the token we find. For example, numbers:
 +
 +<codify C>
 +int lex_number() {
 +    char *buf = 0xF000;
 +    int n = 0;
 +    int i = 0;
 +
 +    // Check for hex: 0x or 0X
 +    if (peek_char() == '0') {
 +        int c0 = next_char();
 +        *(buf + i) = c0;
 +        i = i + 1;
 +        int c1 = peek_char();
 +        if (c1 == 'x') {
 +            lex_hex(buf, i);
 +            return 0;
 +        }
 +        if (c1 == 'X') {
 +            lex_hex(buf, i);
 +            return 0;
 +        }
 +        // Just a leading zero, fall through as decimal
 +    }
 +
 +    // Decimal digits
 +    while (isdigit(peek_char())) {
 +        int c = next_char();
 +        *(buf + i) = c;
 +        i = i + 1;
 +        n = n * 10 + c - '0';
 +    }
 +    *(buf + i) = 0;
 +
 +    g_tok_ival = n;
 +    g_tok_type = TOK_NUMBER;
 +    return 0;
 +}
 +
 +int lex_hex(char *buf, int i) {
 +    int c = next_char();        // consume 'x' or 'X'
 +    *(buf + i) = c;
 +    i = i + 1;
 +    int n = 0;
 +    while (1) {
 +        int c = peek_char();
 +        int digit = hex_digit(c);
 +        if (digit < 0) { break; }
 +        next_char();
 +        *(buf + i) = c;
 +        i = i + 1;
 +        n = n * 16 + digit;
 +    }
 +    *(buf + i) = 0;
 +    g_tok_ival = n;
 +    g_tok_type = TOK_NUMBER;
 +    return 0;
 +}
 +
 +int hex_digit(int c) {
 +    if (c >= '0') {
 +        if (c <= '9') { return c - '0'; }
 +    }
 +    if (c >= 'a') {
 +        if (c <= 'f') { return c - 'a' + 10; }
 +    }
 +    if (c >= 'A') {
 +        if (c <= 'F') { return c - 'A' + 10; }
 +    }
 +    return 0 - 1;
 +}
 +</codify>
 +
 +=== next_token()
 +skip_ws and peek at a token. Then for example:
 +
 +<codify C>
 +    // '/' or '//' comment
 +    if (c == '/') {
 +        next_char();
 +        if (peek_char() == '/') {
 +            // line comment; skip to end of line
 +            next_char();
 +            while (1) {
 +                int cc = peek_char();
 +                if (cc == 0) { break; }  // end of string
 +                if (cc == 10) { break; } // end of line
 +                next_char();
 +            }
 +            return next_token();    // re-enter tokenizer
 +        }
 +        g_tok_type = TOK_SLASH;
 +        return 0;
 +    }
 +</codify>
 +
 +
 +== A classic clobbering bug
 +To do forward declarations, the code used two special global variables to pass information between two parts of the code: KIND and ENTRY. KIND told the code whether the function was already defined (kind 2) or just forward-declared (kind 4). But the same part of the parser that identified a function was being recursively called on the arguments of that function. So during the parsing of the arguments the function type was being over-written. And therefore it's address was not being back-patched in the patch loop chain.
 +
 +Now I know this sounds simple when I lay it out, but this took me 10 hours to fix. It's just a classic clobber-bug but it's a sign this code is starting to weigh me down. I just want to write some games. I don't understand why this has to be so hard. But even after all this I am pretty sure it is easier than writing a GCC or LLVM back end.
 +
 +== Everything in bank 0!
 +I was such a fool to make the symbol table 16 bit. Now everything has to be in bank 0. But there's not enough space for self-hosting. The code AND the compiled code both are over 32k.
 +
 +I have a plan. What if we put the output ~2k before the source code? The tokenizer and other things seem to output less bytes than source code, so it should work.
 +
 +So if HERE grows into the source area as source gets consumed, we can theoretically use up to 48KB for source code AND compiled code. This requires HERE to stay behind SRC at all times. It destroys the code as it compiles, but that's okay. The two pointers march through the memory, with HERE chasing SRC. The only problem is if the emitter somehow catches up with the source code pointer.
 +
 +At this point I am very close to just rewriting the symbol table to be 32 bit. I mean this is already an insane asylum as it is, but, how hard could rewriting the emitters possibly be? :)
 +
 +== Rewriting the symbol table
 +I guess I'm about to find out how hard it can be.
sd/tinyc_developer_diary.1776435140.txt.gz · Last modified: by appledog

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki