-
Notifications
You must be signed in to change notification settings - Fork 3
Optimizations
From "Programming the 32x FAQ" by Toshiyasu Morita
There are several types of optimization applicable to the 32x:
This is basic processor-specific optimization.
Since the three processors of the SH2 share the ROM, and the two SH2s share ROM, there tends to be much bus contention. Reducing bus usage by performing fewer reads and writes helps considerably.
There is typically a lot of wasted time in programs waiting for certain hardware events to happen. Typically these circumstances aren't very obvious.
There are many useful bits of hardware in the 32x; some are:
- Autofill mode
- Run-length mode
- Scanline start table
- SH2 DMA
The 68000 fights with the SH2 for ROM access, and tends to hog the ROM since it is slow at accessing memory. Also, since the 68000 has no cache, if the code is in ROM it will access the ROM for every single instruction it executes! The 68000 running code in ROM can seriously bring a 32x to its knees.
When the frame buffer bit is toggled, it takes time (until the next VBLANK) for the frame buffer to change. Usually most games immediately busy-wait for the bit to change state, which is very bad. There is usually quite a bit of CPU time which can be recovered if the game code flow is reordered in this fashion:
- Toggle frame buffer bit
- Perform AI - player movement, enemy movement
- Perform math (if 3-D game)
- Busy wait to make sure frame buffer has swapped
- Write to frame buffer
- Go to stage 1
The SH2 has a free bus access cycle on every longword instruction, so try to align your memory reads and writes on longword boundaries. In particular, try to insert register-to-register operations between multiply-and-accumulate instructions.
When writing data to the frame buffer in 256 color mode try to accumulate at least two pixels in a register, and do a word/longword write.
If you have a routine which performs many small writes and your CPU is in split-cache mode (2k cache/2k RAM) then try to accumulate your small writes in the on-chip RAM, and then write the data to SDRAM as one big block. This avoids "handing off" the bus back and forth between the SH2s which costs 2-3 clock cycles.
The performance penalty due to bus contention has been measured to be at about 6-10%