sd:emulation_benchmarks
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| sd:emulation_benchmarks [2026/02/22 14:39] – created - external edit 127.0.0.1 | sd:emulation_benchmarks [2026/05/16 00:24] (current) – appledog | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| = Emulation Benchmarks | = Emulation Benchmarks | ||
| - | I have a working theory that the feel of a system is defined by it's limits. Meaning, "this is what we can give you, and no more". Within this space lives the programmer. When restraints are tight the programmer must work hard and come up with something special just to make something in the first place. When restraints are removed, programmers get lazy. Well let's be honest -- not lazy but in some ways more productive and in some ways less. | ||
| - | |||
| - | When you have a lot of extra ram and processing speed, things tend to fall into a 3d open world environment. However, CastleVania SOTN on Playstation 1 showed that even in a 30 MIPS, 32 bit environment (the SNES was 8/16 bit, 1.4 MIPS) you could make a credible 2d RPG. The thing is though, Castlevania could do it because Castlevania came from that legacy. To make SOTN without the existing IP behind it would have been considered a mistake at the time. | ||
| - | |||
| - | Therefore, let us attempt to allow emulation of modes and time periods -- NOT of hardware itself. The 8516 has a decent period-accurate " | ||
| - | |||
| - | == Other Fantasy Computers | ||
| - | * SD-8516 clocks in at 125 MIPS on an i7-12700K (Geekbench 6 baseline 2500) | ||
| - | |||
| - | On the same computer, | ||
| - | |||
| - | * SD-8510 runs at 4.5 MIPS | ||
| - | * XR/Station running ASIX, dhrystone 2.1 shows 9.75 VAX MIPS | ||
| - | |||
| - | |||
| == The Chart | == The Chart | ||
| Legend: | Legend: | ||
| * Green: The SD-8516 is capable of emulating this level of performance on an i7-12700 (Geekbench 6 baseline). | * Green: The SD-8516 is capable of emulating this level of performance on an i7-12700 (Geekbench 6 baseline). | ||
| - | * Light Gray: Although the SD-8516 cannot | + | * Light Green: Capable, but only with extensive use of the PPU and APU. |
| - | * Dark Gray: The SD-8516 cannot emulate this system. This is almost always because it is slower than 50% of the required speed. For example, it is unlikely to achieve Dreamcast-level performance | + | * Light Gray: The SD-8516 cannot |
| - | * Red: The system is unlikely to emulate this level of performance | + | * Dark Gray: The SD-8516 cannot emulate this system. This is almost always because it is designed as a single threaded system with no 3D acceleration. As an emulator, modern computers can often run emulated systems at near-native speeds. But even if we pass-through |
| ^ Year ^ System ^ CPU ^ Width ^ Approx MIPS ^ RAM ^ Graphics ^ Audio ^ Notes ^ | ^ Year ^ System ^ CPU ^ Width ^ Approx MIPS ^ RAM ^ Graphics ^ Audio ^ Notes ^ | ||
| Line 35: | Line 20: | ||
| | @green:1987 | Amiga 500 | 68000 @ 7 MHz | 16/32 | 4.5 | 512 KB–1 MB | 320×256×32/ | | @green:1987 | Amiga 500 | 68000 @ 7 MHz | 16/32 | 4.5 | 512 KB–1 MB | 320×256×32/ | ||
| | @green:1991 | SNES | Ricoh 5A22 | 16-bit | 1.5 | 128 KB + VRAM | 256×224×256 | Advanced Sample Playback | Slow CPU; complex PPU & DMA | | | @green:1991 | SNES | Ricoh 5A22 | 16-bit | 1.5 | 128 KB + VRAM | 256×224×256 | Advanced Sample Playback | Slow CPU; complex PPU & DMA | | ||
| - | | @lightgreen:1991 | 386SX | 80386SX @ 25 MHz | 32-bit | 10 | 4 MB | 320×200×256 VGA | SB Pro | Software rendered; cache key for speed | | + | | @green:1991 | 386SX | 80386SX @ 25 MHz | 32-bit | 10 | 4 MB | 320×200×256 VGA | SB Pro | Software rendered; cache key for speed | |
| - | | @lightgreen:1992 | Amiga 1200 | 68EC020 @ 14 MHz | 32-bit | 14 | 2 MB | 320×256×256 | Paula (DMA-driven PCM playback) | Much easier CPU than SNES to emulate | | + | | @green:1992 | Amiga 1200 | 68EC020 @ 14 MHz | 32-bit | 14 | 2 MB | 320×256×256 | Paula (DMA-driven PCM playback) | Much easier CPU than SNES to emulate | |
| - | | @lightgreen:1992 | 486 Gamer | 486DX2-66 | 32-bit | 54 | 8 MB | 640×480×256 (S3 ViRGE) | SB16 | ~20 FPS software Quake; 3D accel emerging | | + | | @green:1992 | 486 Gamer | 486DX2-66 | 32-bit | 54 | 8 MB | 640×480×256 (S3 ViRGE) | SB16 | ~20 FPS software Quake; 3D accel emerging | |
| - | | @lightgray:1994 | PS1 | MIPS R3000A | 32-bit | 30 | 2 MB + 1 MB VRAM | 320×240 | SPU | Geometry-heavy; | + | | @lightgreen:1994 | PS1 | MIPS R3000A | 32-bit | 30 | 2 MB + 1 MB VRAM | 320×240 | SPU | Geometry-heavy; |
| - | | @lightgray:1994 | Pentium 90 | 586, 90 MHz | 32-bit | 90 | 32 MB | 640×480×16M (PCI VGA) | SB AWE32 | Smooth Quake @ 50+ FPS; Win95 ready | | + | | @lightgreen:1994 | Pentium 90 | 586, 90 MHz | 32-bit | 90 | 32 MB | 640×480×16M (PCI VGA) | SB AWE32 | Smooth Quake @ 50+ FPS; Win95 ready | |
| - | | @lightgray:1996 | N64 | MIPS VR4300 | 64-bit | 125 | 4–8 MB | 320×240–640×480 | | Hard due to RSP/RDP synchronization | | + | | @lightgreen:1996 | N64 | MIPS VR4300 | 64-bit | 125 | 4–8 MB | 320×240–640×480 | | Hard due to RSP/RDP synchronization | |
| - | | @gray:1998 | Dreamcast | SH-4 @ 200 MHz | 32-bit | 360 | 16 MB | 640×480 | AICA | Very emulator-friendly architecture | | + | | @lightgreen:1998 | Dreamcast | SH-4 @ 200 MHz | 32-bit | 360 | 16 MB | 640×480 | AICA | Very emulator-friendly architecture | |
| - | | @red:2000 | PS2 | MIPS R5900 | 128-bit SIMD | 6000 + 40 | 32 MB | 640×448 | SPU2 | Emotion Engine 6k MIPS; +40 MIPS for PS1 compat.; VUs dominate | | + | | @darkgray:2000 | PS2 | MIPS R5900 | 128-bit SIMD | 6000 + 40 | 32 MB | 640×448 | SPU2 | Emotion Engine 6k MIPS; +40 MIPS for PS1 compat.; VUs dominate | |
| - | | @red:2001 | GameCube | Gekko @ 485 MHz | 32-bit | 1125 | 24 MB | 640×480 | Flipper DSP | Dolphin emulator gold standard; clean PowerPC arch | | + | | @lightgray:2001 | GameCube | Gekko @ 485 MHz | 32-bit | 1125 | 24 MB | 640×480 | Flipper DSP | Dolphin emulator gold standard; clean PowerPC arch | |
| - | | @red:2017 | Switch | ARM Cortex-A57 | 64-bit | 12,000 | 4 GB | 720p–1080p | | GPU & OS dominate emulation cost | | + | | @darkgray:2017 | Switch | ARM Cortex-A57 | 64-bit | 12,000 | 4 GB | 720p–1080p | | GPU & OS dominate emulation cost | |
| - | + | == What happened? | |
| - | == Observations | + | |
| Looking at the above chart, you will see "what happened" | Looking at the above chart, you will see "what happened" | ||
| Line 61: | Line 45: | ||
| * Branch prediction: predicted branches (e.g., bottom-of-loop) increased throughput. | * Branch prediction: predicted branches (e.g., bottom-of-loop) increased throughput. | ||
| - | These Pentium-specific traits were exploited via Abrash' | + | These Pentium-specific traits were exploited via Abrash' |
| - | + | ||
| - | == SD-8516 Emulation Strategy | + | |
| - | The SD-8516 operates at a GB6 baseline of 70 mips. This implies it has a similar operational power to a 486DX-66. While development is still ongoing, the strategy will be to allow the programmer to control various system modes that allow the " | + | |
| - | + | ||
| - | Then we can allow modes that simulate the C64/128 era, which represents the 1 to 2 MHZ, 64k-128k RAM " | + | |
| - | + | ||
| - | Finally, we can allow an emulation mode similar to an Amiga 500, 386, or Amiga 1200. This means a semi-advanced workstation; | + | |
| - | + | ||
| - | Next, with some minor graphics acceleration, | + | |
| - | + | ||
| - | However, rewriting in C would blow the lid off performance, | + | |
| - | + | ||
| - | - stabilize the ISA | + | |
| - | - stablilize a kernal | + | |
| - | - Write a BASIC | + | |
| - | - INT helper function library | + | |
| - | - profile instructions and try to get MIPS up to 95-100 | + | |
| - | + | ||
| - | The ISA and kernal are essentially stable. I don't see a need to make major changes now. Possisbly some block operations or floating point operations can be added. But the very likely way this will be done henceforth is through the INT library. I like to keep a simple ISA. As for the kernal, we have a basic input system that looks and feels like a typical language machine -- similar to a BASIC terminal or Python command line. | + | |
| - | + | ||
| - | Functions can get pulled into the kernal maybe, but right now the main direction forward is to get BASIC up and operational by writing as few INT helper functions as needed. Once we have the BASIC system, video/audio mode switching will be slowly implemented by working on creating environments in various video modes, each of which will attempt to recreate the feel of an era. But, the actual operation will be kept separate, so you can mix and match. it will be beautiful! | + | |
| == Profiling Experiments | == Profiling Experiments | ||
| - | Writing assembly language programs | + | Taken on an i7-12700k, a basic loop example executes at 55 MIPS in the WASM version and at 550 MIPS in the C version. However, there' |
| - | ==== Profiling Example | + | === MIPS isn't useful |
| - | The following program illustrates | + | The following program illustrates |
| <codify armasm> | <codify armasm> | ||
| - | .address $010100 | + | .address $000100 |
| - | | + | LDCD $0FFFFFFF |
| - | LDB $0101 | + | |
| - | + | ||
| - | | + | |
| loop: | loop: | ||
| DEC CD ; Decrement CD | DEC CD ; Decrement CD | ||
| - | JNZ @loop ; Jump to loop if CD != 0 | + | JNZ @loop |
| HALT ; Halt when done | HALT ; Halt when done | ||
| </ | </ | ||
| - | This program executes at a certain speed we can call X. It doesn' | + | * WASM version 55 MIPS. |
| - | + | * C version | |
| - | One idea is to increase the number DEC instructions relative to JNZ and see what happens. In the regular run I got a score of 77 MIPS on my 12600k. Increasing the DEC:JNZ ratio to 10:1 brought us down to 56 mips. At 100:1 we got 54 MIPS. | + | |
| - | + | ||
| - | On the other side, a program that tests JNZ to DEC 10:1 brings MIPS up to 91. In either case, a nearly 20 MIPS difference. Therefore clearly, JNZ is a much faster operation than DEC, although you would expect DEC to be a lot faster than JNZ! The reason why is that DEC CD is very slow, as it is a dual register DEC. Moving to single register DEC increases the speed by 50-100%: | + | |
| - | + | ||
| - | <codify armasm> | + | |
| - | .address $010100 | + | |
| - | + | ||
| - | LDC #10000 | + | |
| - | LDD #25000 | + | |
| - | + | ||
| - | loop: | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | DEC C | + | |
| - | JNZ loop | + | |
| - | + | ||
| - | ; C reached zero, decrement D | + | |
| - | LDC #10000 | + | |
| - | DEC D | + | |
| - | JNZ loop | + | |
| - | + | ||
| - | ; done | + | |
| - | HALT | + | |
| - | </ | + | |
| - | + | ||
| - | This version | + | |
| - | + | ||
| - | The following chart indicates the best results out of several runs: | + | |
| - | + | ||
| - | ==== LDA | + | |
| - | + | ||
| - | ^ Instruction ^ Execution time ^ Notes | | + | |
| - | | Empty Loop | 97 MIPS | | | + | |
| - | | LDA [$1000]x10 | 90 MIPS | | | + | |
| - | | LDA [$1000]x100 | 95 MIPS | | | + | |
| - | | LDAL [$1000]x20 | 85 MIPS | Not native word size | | + | |
| - | | LDAB [$1000]x20 | 76 MIPS | unexpected! will check code | | + | |
| - | | LDBLX [$1000]x20 | 25 MIPS | array method method | | + | |
| - | | LDBLX [$1000]x20 | 45 MIPS | switch method | | + | |
| - | | LDBLX [$1000]x20 | 64 MIPS | unified memory reads | | + | |
| - | | LDBLX [$1000]x20 | 73 MIPS | inlined acceess | | + | |
| - | + | ||
| - | ==== Notes on LDA/LDAL | + | |
| - | This is likely a branch prediction and instruction cache artifact in the Web Assembly/ | + | |
| - | With the empty loop, the CPU's branch predictor may be working against speculative execution overhead. Adding a single LDA gives the pipeline something productive to do between branches, potentially hiding some of the branch misprediction penalty or better aligning the instruction stream. | + | |
| - | At 10-20 instructions, | + | |
| - | + | ||
| - | Increased loop body size may cause instruction cache pressure | + | |
| - | More register pressure in the generated machine code | + | |
| - | Loop overhead becomes proportionally smaller but absolute instruction decode cost increases | + | |
| - | + | ||
| - | The LDAL slowdown confirms this - non-native 32-bit operations require more complex codegen, putting additional pressure on the optimizer. | + | |
| - | This is classic JIT behavior: a tiny amount of work can sometimes improve performance by giving the CPU's execution units better scheduling opportunities, | + | |
| - | You might also be seeing alignment effects - the single instruction could be placing the loop branch at an optimal address boundary. | + | |
| - | + | ||
| - | Finally, using LDBLX as a proxy for the process we went through earlier, we achieved a 3x speedup by using a switch versus a map, unifying <u8> memory reads into < | + | |
| - | + | ||
| - | I wouldn' | + | |
| - | ==== DEC | + | The issue occurs when we try to replace the DEC/JNZ pair with LSTEP, |
| - | A loop with 20xDEC had a high mark of 104.7 MIPS. | + | |
| - | ==== PUSH/POP | + | Another example, I had benchmarked kernal 0.7.2 at 750 MIPS, then I switched kernals to from 0.7.2 to 0.8.3. This had the effect of putting CASETAB into the hot path. So instead of performing hundreds of JZ and CMP instructions |
| - | * PUSH and POP are slower operations, in the 80-85 MIPS range. | + | |
| - | * But PUSHA/POPA are noticeably slow, in the 27 MIPS range. | + | |
| - | * Using PUSHA/POPA everywhere will kill performance. We saw a 25% increase in speed after moving from PUSHA to push (reg). | + | |
| + | === Conclusion: CISC vs RISC | ||
| + | Time spent on the hot path is slow, while time spent in the hot path is fast. That is, just like the WASM version, the C version does best with CISC instructions. MIPS itself, is not as important as it seems. What matters most is the quality of the instruction set. | ||
sd/emulation_benchmarks.1771771148.txt.gz · Last modified: by 127.0.0.1
