Differences

This shows you the differences between two versions of the page.

--- sd:emulation_benchmarks [2026/05/15 23:41] – appledog
+++ sd:emulation_benchmarks [2026/05/16 00:24] (current) – appledog
@@ Line 4: / Line 4: @@
 Legend:
 * Green: The SD-8516 is capable of emulating this level of performance on an i7-12700 (Geekbench 6 baseline).
-* Light Gray: Although the SD-8516 cannot currently emulate this level of performance, it is expected to once graphics acceleration is emulated.
+* Light Green: Capable, but only with extensive use of the PPU and APU.
-* Dark Gray: The SD-8516 cannot emulate this system. This is almost always because it is slower than 50% of the required speed. For example, it is unlikely to achieve Dreamcast-level performance even on the new M5 Pro Max chips (4000+ single core). This level of performance would require a C rewrite and some form of multithreading.
+* Light Gray: The SD-8516 cannot reach this level of performance, but can reach at least 60%, so ports are possible but may be heavily constrained. Optimization in Assembly may be required.
-* Red: The system is unlikely to emulate this level of performance because it's advanced hardware. It is possible that a C rewrite would get close, but would require enthusiast hardware. These benchmarks are for GB6 baseline systems.
+* Dark Gray: The SD-8516 cannot emulate this system. This is almost always because it is designed as a single threaded system with no 3D acceleration. As an emulator, modern computers can often run emulated systems at near-native speeds. But even if we pass-through the GPU via SDL2 we are unlikely to be able to emulate this class of system because we don't have multi-threading. Given time it is likely computers will continue to increase in speed and we will be able to emulate these systems performance; if I ever add the ability to multitask, these limitations will almost certainly disappear.
 ^ Year ^ System ^ CPU ^ Width ^ Approx MIPS ^ RAM ^ Graphics ^ Audio ^ Notes ^
@@ Line 27: / Line 27: @@
 | @lightgreen:1996 | N64 | MIPS VR4300 | 64-bit | 125 | 4–8 MB | 320×240–640×480 | | Hard due to RSP/RDP synchronization |
 | @lightgreen:1998 | Dreamcast | SH-4 @ 200 MHz | 32-bit | 360 | 16 MB | 640×480 | AICA | Very emulator-friendly architecture |
-| @red:2000 | PS2 | MIPS R5900 | 128-bit SIMD | 6000 + 40 | 32 MB | 640×448 | SPU2 | Emotion Engine 6k MIPS; +40 MIPS for PS1 compat.; VUs dominate |
+| @darkgray:2000 | PS2 | MIPS R5900 | 128-bit SIMD | 6000 + 40 | 32 MB | 640×448 | SPU2 | Emotion Engine 6k MIPS; +40 MIPS for PS1 compat.; VUs dominate |
-| @red:2001 | GameCube | Gekko @ 485 MHz | 32-bit | 1125 | 24 MB | 640×480 | Flipper DSP | Dolphin emulator gold standard; clean PowerPC arch |
+| @lightgray:2001 | GameCube | Gekko @ 485 MHz | 32-bit | 1125 | 24 MB | 640×480 | Flipper DSP | Dolphin emulator gold standard; clean PowerPC arch |
-| @red:2017 | Switch | ARM Cortex-A57 | 64-bit | 12,000 | 4 GB | 720p–1080p | | GPU & OS dominate emulation cost |
+| @darkgray:2017 | Switch | ARM Cortex-A57 | 64-bit | 12,000 | 4 GB | 720p–1080p | | GPU & OS dominate emulation cost |
@@ Line 45: / Line 45: @@
 * Branch prediction: predicted branches (e.g., bottom-of-loop) increased throughput.
-These Pentium-specific traits were exploited via Abrash's hand-tuned ASM (id386.asm) delivered a 3x speedup over 486DX4-100 and AMD/Cyrix 5x86-133 style CPUs, crushing the clones' weaker floatinf point pipelining and marginalizing them in gaming. Pentium began to dominate the 1996 PC market as Quake's "minimum viable" software 3D benchmark, shifting devs from CPU raster hacks to hardware offload. Next, GLQuake/Voodoo (1996) hit 60+ FPS by rasterizing on GPUs, birthing the 3D acceleration era.
+These Pentium-specific traits were exploited via Abrash's hand-tuned ASM (id386.asm) delivered a 3x speedup over 486DX4-100 and AMD/Cyrix 5x86-133 style CPUs, crushing the clones' weaker floating point pipelining and marginalizing them in gaming. Pentium began to dominate the 1996 PC market as Quake's "minimum viable" software 3D benchmark, shifting devs from CPU raster hacks to hardware offload. Next, GLQuake/Voodoo (1996) hit 60+ FPS by rasterizing on GPUs, birthing the 3D acceleration era.
 == Profiling Experiments
-Here are the results of tight-loop experiments featuring benchmarks of one instruction. They are intended as relative results only. Taken on an i7-12700k.
+Taken on an i7-12700k, a basic loop example executes at 55 MIPS in the WASM version and at 550 MIPS in the C version. However, there's an issue if we go beyond this relative benchmark.
-The basic "unrolled LDA $1000" example executes at 95 MIPS in the WASM version and at 675 MIPS in the C version. However, there's an issue if we go beyond this relative benchmark.
 === MIPS isn't useful
-The following program illustrates a baseline:
+The following program illustrates the problem with MIPS:
 <codify armasm>
@@ Line 68: / Line 66: @@
 * WASM version 55 MIPS.
-* C version was 560 MIPS.
+* C version was 550 MIPS.
-But here's the problem with MIPS. LSTEP, a command that does DEC CD and JNZ in one step, performs at 360 MIPS (in the C version). This shows that MIPS is somewhat deceptive as a measurement. The LSTEP command is performing the work of both DEC and JNZ in less time than each; but since it is a relatively slow command in and of itself it lowers the MIPS of the system as a whole. In reality if it was counted as two instructions it would show over ~700 MIPS. If we use a dual stage LSTEP (on two 16 bit registers) it runs in 470 MIPS (940 mips equivalent).
-=== CISC vs RISC
+The issue occurs when we try to replace the DEC/JNZ pair with LSTEP, a command that does DEC CD and JNZ in a single step. Using LSTEP seems to lower performance to 360 MIPS (in the C version). Why? The LSTEP command is performing the work of both DEC and JNZ, but since it is a relatively slow instruction it lowers the MIPS of the system as a whole. Yet it is still faster overall to use LSTEP than DEC/JNZ. If LSTEP was counted as two instuctions it would show over ~700 MIPS compared to DEC/JNZ's 550.
-This shows that time spent on the hot path is slow, while time spent in the hot path is fast. That is, just like the WASM version, the C version does best with CISC instructions. MIPS itself, is not as important as it seems. What matters is the quality of the instruction set.
-//Using a RISC-like ISA is only a requirement if you are emulating a particular architecture. It is not a good idea for a fantasy computer in general. A fantasy computer does better with CISC instructions.//
+Another example, I had benchmarked kernal 0.7.2 at 750 MIPS, then I switched kernals to from 0.7.2 to 0.8.3. This had the effect of putting CASETAB into the hot path. So instead of performing hundreds of JZ and CMP instructions in the INT $10 jumptable, it performed one CASETAB. MIPS dropped to 500 but the system ran twice as fast. That's the real takeaway; despite having a lower number of MIPS, the system runs measurably faster.
-A final example; I had benchmarked kernal 0.7.2 at 750 MIPS, then I switched kernals to from 0.7.2 to 0.8.3. This had the effect of putting CASETAB into the hot path. This meant that instead of performing hundreds of JZ and CMP instructions per system loop, we performed one CASETAB. MIPS dropped to 550. But I assure you, the system is running much faster. You see, the runtime itself is lower; 10 seconds at 750 MIPS is slower than 5 seconds at 550 mips. That's the real takeaway; despite having a lower number of MIPS, the system runs measurably faster with the new CISC instructions.
+=== Conclusion: CISC vs RISC
+Time spent on the hot path is slow, while time spent in the hot path is fast. That is, just like the WASM version, the C version does best with CISC instructions. MIPS itself, is not as important as it seems. What matters most is the quality of the instruction set.