sd:one_instruction_per_cycle
Differences
This shows you the differences between two versions of the page.
| sd:one_instruction_per_cycle [2026/04/14 06:09] – created appledog | sd:one_instruction_per_cycle [Unknown date] (current) – external edit (Unknown date) 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | = One Instruction Per Cycle | ||
| + | |||
| + | When discussing emulators, people often bring up the concept of "cycle accurate" | ||
| + | |||
| + | However, the SD-8516 is not trying to emulate anything, and is thus unconstrained by the need to impose artificial limitations on things like cycles per instruction. It has no need to perform at a certain speed. Therefore, how long does it take to execute one instruction operation? | ||
| + | |||
| + | One cycle. | ||
| + | |||
| + | Every cycle, or " | ||
| + | |||
| + | The issue is touching memory. Every time you need to touch memory, it slows the system down. One cycle is one cycle. if it has to run through an IF, it takes about 20-30% extra time. And for every memory access, it slows down about as much. | ||
| + | |||
| + | If I had to guess, I would say that if each operation had a base speed of 10, then a memory access would add 4 and an if would add 4. This a very rough estimate. So, an instruction that has to fetch an extra word, such as '' | ||
| + | |||
| + | LDA $2233 ; takes 10 units of time | ||
| + | LDB $0011 ; takes 10 units of time | ||
| + | | ||
| + | LDAB $00112233 ; takes 16 units of time | ||
| + | |||
| + | Something like this, it's very rough. | ||
| + | |||
| + | Where this gets interesting is in certain algorithms that do something like SHL to " | ||
| + | |||
| + | ; Multiply A by 16: | ||
| + | SHL A | ||
| + | SHL A | ||
| + | SHL A | ||
| + | SHL A | ||
| + | |||
| + | This takes 40 units of time, 10 for each op. But this takes 10 units of time (maybe 11): | ||
| + | |||
| + | MUL A, #16 | ||
| + | |||
| + | So one must always remember that since the execution loop as a fixed amount of time, fewer instructions is almost always faster than multiple. | ||
| + | |||
| + | It also pays huge to remember which operations set flags. On a lower-memory computer such as an Apple ][, BBC Micro, C64, TRS-80 etc. you would want to know this to save space. Here, you need to know it to save time as well! | ||
| + | |||
| + | Example: | ||
| + | |||
| + | loop: | ||
| + | DEC X | ||
| + | CMP X, #0 | ||
| + | JZ @finish | ||
| + | JMP @loop | ||
| + | |||
| + | finish: | ||
| + | RET | ||
| + | |||
| + | This example program takes 4 instuctions and 8 bytes of memory access. Can we do better? | ||
| + | |||
| + | loop: | ||
| + | DEC X | ||
| + | JNZ @loop | ||
| + | |||
| + | finish: | ||
| + | RET | ||
| + | |||
| + | This is much faster. We know that DEC will set the zero flag if it hits a zero. Second, we invert the check and use a fallthough to get to the continue point. So this inner loop is more than twice as fast as the last one. This works with LDA as well: | ||
| + | |||
| + | |||
| + | str_loop: | ||
| + | LDAL [ELM] | ||
| + | LDBL [FLD] | ||
| + | JZ @str_end | ||
| + | | ||
| + | CMP AL, BL ; Are the strings equal? | ||
| + | JNZ @str_loop | ||
| + | |||
| + | str_end: | ||
| + | CMP AL, BL ; Just perform the check again to make sure that if BL was 0 they both ended. | ||
| + | ; Now we can return the strcmp results. | ||
| + | |||
| + | This idea shows that if LD load a zero, we can immediately check. We do not need a CMP. | ||
| + | |||
| + | The idea is that fewer instructions are faster. | ||
| + | |||
| + | * On normal CPUs MUL and DIV are very slow. | ||
| + | ** On this, it's faster than several SHLs. | ||
| + | |||
| + | * DIV is also faster than several SHRs. | ||
| + | |||
| + | * Remember which operations set flags so you don't waste time on a CMP. | ||
| + | |||
| + | == Oh help, i'm running out of registers! | ||
| + | There is a thing where it's possible to get confused over register pairing and run out of registers for your convention. In this case there' | ||
| + | |||
| + | LDA $500 ; loads $500 into A. | ||
| + | |||
| + | The above compiles to: 00 00 05 00. Thats four bytes that need to be read in. Lets say then that the cost of this instruction is to load four bytes from WASM memory. That is the essential cost; the CPU fetch-execute cycle itself has a fixed cost, so we can ignore it or add some constant. | ||
| + | |||
| + | Now consider: | ||
| + | |||
| + | MOV A, B | ||
| + | |||
| + | This costs three bytes to fetch and execute; one byte for the opcode and one for each register. What about from memory? | ||
| + | |||
| + | LDA [$000000] ; This is a 5 byte instruction. | ||
| + | | ||
| + | At five bytes, it's 30-40% slower, on average, than LDA $500. But for scratch values, that are infrequently used, and not in a hot path, this frees up registers. In other words, feel free to use memory! | ||
| + | |||
| + | Now, if you consider the number of operations and bytes a simple push/pop would use (to juggle a register) maybe it's better to keep certain kinds of pointer or scratch value in memory versus in a register? It would need to be done with consideration, | ||
| + | | ||
