Render a software depth buffer in parallel with HW rendering #19748

hrydgard · 2024-12-18T17:55:52Z

This will fix the lens flare problems in a lot of games: #15923

Many games check if they should draw flares by reading from the depth buffer using the CPU. Unfortunately when we render using the host GPU, it's very, very expensive to read things back.

So instead, this renders an approximate depth buffer into the emulated PSP VRAM directly, using the host CPU.

Approximate - yes, we can take some shortcuts here, while still getting good results. Subpixel precision isn't really needed, and we can probably just skip skinned meshes (although we can of course implement that too if needed).

And there's even more potential trickery we can do to speed it up:

Render very small triangles as little squares, eliminating much of the triangle setup cost
Skip very small triangles entirely
Rasterize at half resolution, writing double pixels (although hoping to not need this)

Since there's only a small subset of games that have any use for this, it will be enabled with the compat.ini setting [SoftwareRasterDepth].

Before we start enabling it for games, this needs a lot of optimization, and also should run on a separate thread (or threads), syncing up on every framebuffer switch or maybe on every DrawSync. Although in debug mode, it should go draw by draw.

Potential inaccuracies:

This does not do alpha testing, so cardboard cutouts will block the sun even when they shouldn't. Will only implement this if it's badly needed, as it will add a lot of complexity (need to interpolate UVs and do texturing, more like the full sw renderer)
As mentioned, skinned meshes are simply skipped.
Z clipping to the near plane is not done, triangles are simply discarded if they intersect it

The new GE debugger visualizations helped a lot:

Note to self: The madd trick from https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ may be useful for SIMD-ing the triangle setup on SSE.

fp64 · 2024-12-20T02:36:25Z

GPU/Common/DepthRaster.cpp

+		case GE_COMP_ALWAYS:
+			while (w >= 8) {
+				_mm_storeu_si128(ptr, valueX8);
+				ptr++;


Nitpick, but incrementing possibly-misaligned pointer feels a bit iffy (though compilers probably won't do anything weird here), compared to only casting inside _mm_storeu_si128(...).

I do think this is fine actually, _mm_storeu_si128 is meant to handle misaligned pointers, unlike _mm_store_si128. (although on modern CPUs, they are the same).

I might check for alignment and do separate loops later, but in all cases in the relevant games that I've seen, this is just used for clearing the background and is aligned.

fp64 · 2024-12-20T02:40:49Z

GPU/Common/DepthRaster.cpp

+			beta += A1,
+			gamma += A2)
+		{
+			int mask = alpha >= 0 && beta >= 0 && gamma >= 0;


A classic trick is int mask = (alpha | beta | gamma);, followed by if(mask < 0) continue;.
Extends to SIMD by _mm_movemask_ps to check the 4 signs (on x86, not sure about NEON).

Addendum: this is what SoftGPU does for NEON:

ppsspp/GPU/Software/Rasterizer.cpp

Line 946 in 53cb014

#elif PPSSPP_ARCH(ARM64_NEON)

fp64 · 2024-12-20T02:42:49Z

GPU/Common/DepthRaster.cpp

+			float previousDepthValue = (float)depthBuf[idx];
+
+			int depthMask;
+			switch (compareMode) {


Would it make sense to template on compareMode (with added dispatcher function with a runtime argument, that selects one of the templates)?

yes. that will come later.

hrydgard · 2024-12-20T07:29:05Z

@fp64 this is not meant to showcase properly optimized code yet - first step it getting it to work, then I'll SIMD it properly. So no need to review yet, this is in draft mode :)

hrydgard · 2024-12-20T15:29:16Z

CrossSIMD.h is becoming really useful, it was very easy to SIMD-ify DepthRasterClipIndexedTriangles for a 2x+ speed boost.

Now, the main thing taking time is of course DepthRasterTriangle. But before taking on that, I'll have to implement proper cull modes.

hrydgard · 2024-12-20T19:38:39Z

Some progress:

Now correctly rasterizes depth of large scenes in Wipeout, Midnight Club etc, fixing lens flare effects
My "CrossSIMD" classes for automatic SSE+NEON support is working out very well
Everything up to the rasterizer itself has now been optimized. The rasterizer itself though is totally unoptimized, so it eats like 15-20% CPU on a fast PC, which isn't good enough.

Unfortunately right now Syphon Filter is broken, but will fix again of course.

hrydgard · 2024-12-21T11:09:53Z

Alright, the inner loop is now SIMD-optimized and works on x86-64 and ARM.

Unfortunately it's really slow in debug mode (unsurprisingly), so some #pragma optimize might be motivated...

I'm gonna get this in soon, but not enable it in compat.ini just yet, it needs to be multithreaded too first to minimize the performance impact.

hrydgard · 2024-12-21T12:43:37Z

@fp64 you're welcome to have another look now :)

(I do know that some things can do with more optimization, like the triangle setup should be done for four triangles in parallel, and to support that, the x/y/z arrays should be reorganized).

…onversion

Checked the output, the generated assembly is great!

hrydgard · 2024-12-21T15:35:07Z

Alright, I'm gonna get this in as-is. It's disabled by default, but just use compat.ini as mentioned above to try it. You can use the Pixel Viewer in the new Ge debugger to inspect the rendered result.

fp64 · 2024-12-22T06:39:17Z

Some comments, in no particular order.

ppsspp/Common/Math/CrossSIMD.h

Line 57 in 80cb57f

    
           // NOTE: This uses a CrossSIMD wrapper if we don't compile with SSE4 support, and is thus slow.

Out-of-date comment? Since wrapper is now used regardless (and wouldn't the alternative be bad, as _M_SSE is set to 0x402 on windows?).

I see that this ignores top-left rule, same as
Intel's Software Occlusion Culling demo does. Probably ok for approximate rasterization. What does real PSP do, anyway? SoftGPU's IsRightSideOrFlatBottomLine is probably not quite right.

Is there a risk of integer overflow in edge function computations? If all vertices are inside 480x272 rectangle, then no (even with supbixel precision), but are triangles clipped to screen beforehand (which can introduce extra vertices)? And is resolution always 1x?

Trying to figure out how much SIMDification of triangle setup would gain.
Back when I was testing the speed of my pure-depth-write rasterizer (warning: somewhat buggy, at least top-left stuff is probably wrong) it looked like this (on whatever machine rextester uses to run code):

Tris	CCW Tris	Pixels	AvgArea	Time,us	T/tri,ns	T/px,ns
1000000	499981	39421	0.08	67200.0	67.2	1704.7
100000	49926	15238	0.31	6890.0	68.9	452.2
100000	49926	61009	1.22	8183.0	81.8	134.1
100000	49926	243212	4.87	11112.0	111.1	45.7
10000	4923	94122	19.12	1764.0	176.4	18.7
10000	4923	368064	74.76	3843.0	384.3	10.4
10000	4923	1416831	287.80	8410.0	841.0	5.9
1000	486	515371	1060.43	1913.0	1913.0	3.7
1000	486	1766270	3634.30	4570.0	4570.0	2.6
1000	486	4559285	9381.24	8659.0	8659.0	1.9

This is SSE2 for rasterization (but it uses 2x2 quads, and scalar stores somehow; using store2x64() would be probably faster), and scalar triangle setup (all single-threaded).
The plot doesn't actually fit a nice time = A*tris+B*pixels line,
but very roughly we have ~100ns per triangle plus ~2ns per pixel. That gets us ~1ms at ~10k triangles.

For large triangles I got around x1.4 speedup by splitting them into 8x8 blocks, and special-casing empty and full blocks.
You can test it in my example above by setting

bool use_blocks=false;

It might be cheaper to just step (floating-point) Z, rather than compute it from barycentrics (certainly, if you special-case full blocks not to check - or at all use - barycentrics at all). Computing interpolants from barycentrics may reduce register pressure if there are many of them (texcoords, normals, etc.) - but we only have Z. There might be accuracy impact (edge function increments are exact; floating-point dZdX, dZdY might not be), but hopefully small.

Speaking of, the DepthRasterTriangle() interpolates Z, not 1/Z. SoftGPU seems to be doing the same, so I assume this is what PSP does?

Unrolling inner loop 2x to load/store 128 bits at a time (and not 64 like now) might be a win (Vec8U16 instead Vec4U16).

Are tileStartX, tileEndX always multiples of 4? Otherwise code may stomp outside scissor, it seems.

About

		// Use a couple Newton-Raphson steps to refine the estimate.
		// May be able to get away with only one refinement, not sure!

My understanding is that VRECPE of 1.0f is 511/512, and after the 1st step you get 1-2^(-18), which is not quite enough if you use it to implement division.

hrydgard · 2024-12-22T10:11:16Z

Thanks for looking!

Not quite an outdated comment, but I haven't ended up using that.

Since we are drawing directly to PSP VRAM at its native resolution, where the real hardware draws, yes, it's always 1x.

Integer overflow might be a possibility, yes. I should reject triangles if any corner is outside a 4096x4096 box, just like the PSP itself does (it doesn't clip on the sides).

Yeah, doing a hierarchical rasterizer with bigger blocks may be interesting. No triangles are really huge though because we're limited to a resolution of 480x272, but it still may be worth it. It will add some extra complexity though for sure.

All modern GPUs, including ones as far back as the PSP indeed, just interpolates Z (which is z/w) becaused it's linear in screen space, so you don't need to divide per pixel.

I've already changed it to step Z instead of computing from barycentrics, in #19758 . It's slightly more setup though so it might hurt very small triangles, however I also started rejecting triangles with an area under 10 5 pixels (we are computing area*2) which seems to be perfectly fine for the purpose, and saves a bit of CPU.

I am thinking of moving the bbox calculation out to DepthRasterClipIndexedTriangles, to avoid even queueing up discarded triangles, that will make sure that a SIMD-ified setup will actually always chew on something meaningful.

Because right now, about 70-85% of triangles are (correctly) rejected before we enter the raster loop, which is more than I expected! Either because too small, or hit a screen border, or backface culled.

The question with unrolling the innerloop is if it's better to go to 4x2 blocks or 8x1 blocks. My intuition says the former, although we won't get the benefit of nice 128-bit loads/stores then... And also, not sure if we have enough registers left for it to be a win.
I think 2x2 blocks are likely not worth it - it would be if we could store the depth buffer swizzled, but we can't.

I do need to pay more attention to the tileStartX/scissor, you're right. Though games using odd-sized or odd-positioned scissors are rare, and I don't think any of the relevant games do.

fp64 · 2024-12-22T11:35:07Z

Unolled 8x1 fits nicely into "fully-unrolled 8x8", if one wants to go that way. For small triangles - perhaps not great.
Needing extra regs might hurt, yes.

Just in case you were not aware: there's a neat trick to effectively have more bits before we hit integer overflow - since we step the full pixels, the lower bits of edge functions do not change (and therefore do not carry into the sign bit), and do not need to be stored. With 4-bit subpixel that gets us 8192x8192 render target (instead of 2048x2048 without this trick) using int32 (which would nicely contain 4096x4096 box you mentioned). It does require int64 at triangle setup, hurting its SIMDability, though. Not sure how important subpixel accuracy is for this task. Without it, the render target is comfortable 32768x32768, fully in int32.

hrydgard · 2024-12-22T11:39:15Z

For the purposes of lens flare occlusion, I don't think subpixel is very meaningful at all, really. In the games I've tried, the current setup works perceptually perfectly. But yeah, I've read ryg's blog, so I know about the trick :)

hrydgard added this to the v1.19.0 milestone Dec 18, 2024

hrydgard marked this pull request as draft December 18, 2024 17:57

fp64 reviewed Dec 20, 2024

View reviewed changes

hrydgard force-pushed the software-depth-proto branch 3 times, most recently from 24be2a7 to 8fa1429 Compare December 20, 2024 14:32

GE debugger improvements

58adb37

hrydgard force-pushed the software-depth-proto branch from e104976 to 2b24230 Compare December 20, 2024 19:35

hrydgard marked this pull request as ready for review December 21, 2024 12:42

hrydgard added 14 commits December 21, 2024 14:19

Add "Realtime" checkbox to pixel viewer

b442183

Add DepthRaster.cpp/h. Rasterize depth rectangles, some triangles

c5ad81e

Remove subpixel precision. Some sketching.

d27d8c9

One less operation in the inner loop

09afe36

Add convenient wrappers

72c954d

Move prototype cross simd wrapper structs to CrossSIMD.h

c92b3b6

DepthRaster: Premultiply world-view-proj matrices

c7f0eab

DepthRaster: Merge the decode and transform steps

dd31518

Reorganize the depth vertex pipeline for future optimizations

bdb5f3a

Warning fixes, minor cleanup

bdf4b69

Reformat CrossSIMD.h for easier editing. Add some new methods.

de45960

Add more funcionality to CrossSIMD.h, like fast matrix mul and some c…

03b9f98

…onversion

Use CrossSIMD to optimize DecodeAndTransformForDepthRaster

6a1010a

Checked the output, the generated assembly is great!

CrossSIMD: Add reciprocal, clamp, swaplowerelements, etc

0b009c1

hrydgard added 13 commits December 21, 2024 14:28

Depth raster: Switch to a SoA data layout for the screen space verts

67078d4

Speed up DepthRasterClipIndexedTriangles with CrossSIMD

820e736

CrossSIMD: possible buildfix?

65692d0

DepthRaster: Fix bug where we used the wrong vertex count.

a344d02

DepthRaster: Fix backface culling

f886578

Minor sign check optimization

ad18098

Comment

d1b50ea

CrossSIMD: Add a bunch more functonality for use by the rasterizer

2051d55

CrossSIMD: make the transpose function compatible with ARM32

399570e

Reimplement the depth rasterizer with SIMD.

73ae6da

Convert the rect implementation to CrossSIMD

5df88fc

AnyZeroSignBit arm fix, more crosssimd fixes. Now works on ARM.

8cd86b4

Cleanup

80cb57f

hrydgard force-pushed the software-depth-proto branch from 0dd55d9 to 80cb57f Compare December 21, 2024 13:32

hrydgard changed the title ~~Software depth buffer prototype~~ Render a software depth buffer in parallel with HW rendering Dec 21, 2024

hrydgard merged commit 4b4d30e into master Dec 21, 2024
19 checks passed

hrydgard deleted the software-depth-proto branch December 21, 2024 15:35

This was referenced Dec 21, 2024

Midnight Club L.A. Remix sun flare goes through solid objects #18625

Closed

Depth raster: Fix when software transform is enabled, support non-through mode rectangles #19756

Merged

Depth raster TODO list #19757

Open

hrydgard mentioned this pull request Dec 22, 2024

Syphon Filter Dark Mirror - lights are visible through buildings #10229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render a software depth buffer in parallel with HW rendering #19748

Render a software depth buffer in parallel with HW rendering #19748

hrydgard commented Dec 18, 2024 •

edited

Loading

fp64 Dec 20, 2024

hrydgard Dec 20, 2024 •

edited

Loading

fp64 Dec 20, 2024

fp64 Dec 20, 2024

fp64 Dec 20, 2024

hrydgard Dec 20, 2024

hrydgard commented Dec 20, 2024

hrydgard commented Dec 20, 2024

hrydgard commented Dec 20, 2024

hrydgard commented Dec 21, 2024

hrydgard commented Dec 21, 2024

hrydgard commented Dec 21, 2024

fp64 commented Dec 22, 2024

hrydgard commented Dec 22, 2024 •

edited

Loading

fp64 commented Dec 22, 2024

hrydgard commented Dec 22, 2024

Render a software depth buffer in parallel with HW rendering #19748

Render a software depth buffer in parallel with HW rendering #19748

Conversation

hrydgard commented Dec 18, 2024 • edited Loading

fp64 Dec 20, 2024

Choose a reason for hiding this comment

hrydgard Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

fp64 Dec 20, 2024

Choose a reason for hiding this comment

fp64 Dec 20, 2024

Choose a reason for hiding this comment

fp64 Dec 20, 2024

Choose a reason for hiding this comment

hrydgard Dec 20, 2024

Choose a reason for hiding this comment

hrydgard commented Dec 20, 2024

hrydgard commented Dec 20, 2024

hrydgard commented Dec 20, 2024

hrydgard commented Dec 21, 2024

hrydgard commented Dec 21, 2024

hrydgard commented Dec 21, 2024

fp64 commented Dec 22, 2024

hrydgard commented Dec 22, 2024 • edited Loading

fp64 commented Dec 22, 2024

hrydgard commented Dec 22, 2024

hrydgard commented Dec 18, 2024 •

edited

Loading

hrydgard Dec 20, 2024 •

edited

Loading

hrydgard commented Dec 22, 2024 •

edited

Loading