Skip to content
Victor edited this page Sep 28, 2021 · 11 revisions

From "Programming the 32x FAQ" by Toshiyasu Morita

There are several types of optimization applicable to the 32x:

CPU optimizations

This is basic processor-specific optimization.

Bus bandwidth optimization

Since the three processors of the SH2 share the ROM, and the two SH2s share ROM, there tends to be much bus contention. Reducing bus usage by performing fewer reads and writes helps considerably.

Salvaging wasted time

There is typically a lot of wasted time in programs waiting for certain hardware events to happen. Typically these circumstances aren't very obvious.

Using special hardware effectively

There are many useful bits of hardware in the 32x; some are:

  • Autofill mode
  • Run-length mode
  • Scanline start table
  • SH2 DMA

Examples of optimizations:

Move 68000 code into RAM (bus optimization)

The 68000 fights with the SH2 for ROM access, and tends to hog the ROM since it is slow at accessing memory. Also, since the 68000 has no cache, if the code is in ROM it will access the ROM for every single instruction it executes! The 68000 running code in ROM can seriously bring a 32x to its knees.

Salvaging frame buffer swap time (salvaging wasted time optimization)

When the frame buffer bit is toggled, it takes time (until the next VBLANK) for the frame buffer to change. Usually most games immediately busy-wait for the bit to change state, which is very bad. There is usually quite a bit of CPU time which can be recovered if the game code flow is reordered in this fashion:

  1. Toggle frame buffer bit
  2. Perform AI - player movement, enemy movement
  3. Perform math (if 3-D game)
  4. Busy wait to make sure frame buffer has swapped
  5. Write to frame buffer
  6. Go to stage 1

Align memory access on longword boundaries (CPU optimization)

The SH2 has a free bus access cycle on every longword instruction, so try to align your memory reads and writes on longword boundaries. In particular, try to insert register-to-register operations between multiply-and-accumulate instructions.

Try to keep the SH2s off the bus (CPU optimization)

When writing data to the frame buffer in 256 color mode try to accumulate at least two pixels in a register, and do a word/longword write.

If you have a routine which performs many small writes and your CPU is in split-cache mode (2k cache/2k RAM) then try to accumulate your small writes in the on-chip RAM, and then write the data to SDRAM as one big block. This avoids "handing off" the bus back and forth between the SH2s which costs 2-3 clock cycles.

The performance penalty due to bus contention has been measured to be at about 6-10%

Clone this wiki locally