Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render a software depth buffer in parallel with HW rendering #19748

Merged
merged 28 commits into from
Dec 21, 2024

Conversation

hrydgard
Copy link
Owner

@hrydgard hrydgard commented Dec 18, 2024

This will fix the lens flare problems in a lot of games: #15923

Many games check if they should draw flares by reading from the depth buffer using the CPU. Unfortunately when we render using the host GPU, it's very, very expensive to read things back.

So instead, this renders an approximate depth buffer into the emulated PSP VRAM directly, using the host CPU.

Approximate - yes, we can take some shortcuts here, while still getting good results. Subpixel precision isn't really needed, and we can probably just skip skinned meshes (although we can of course implement that too if needed).

And there's even more potential trickery we can do to speed it up:

  • Render very small triangles as little squares, eliminating much of the triangle setup cost
  • Skip very small triangles entirely
  • Rasterize at half resolution, writing double pixels (although hoping to not need this)

Since there's only a small subset of games that have any use for this, it will be enabled with the compat.ini setting [SoftwareRasterDepth].

Before we start enabling it for games, this needs a lot of optimization, and also should run on a separate thread (or threads), syncing up on every framebuffer switch or maybe on every DrawSync. Although in debug mode, it should go draw by draw.

Potential inaccuracies:

  • This does not do alpha testing, so cardboard cutouts will block the sun even when they shouldn't. Will only implement this if it's badly needed, as it will add a lot of complexity (need to interpolate UVs and do texturing, more like the full sw renderer)
  • As mentioned, skinned meshes are simply skipped.
  • Z clipping to the near plane is not done, triangles are simply discarded if they intersect it

The new GE debugger visualizations helped a lot:

image

Note to self: The madd trick from https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ may be useful for SIMD-ing the triangle setup on SSE.

@hrydgard hrydgard added this to the v1.19.0 milestone Dec 18, 2024
@hrydgard hrydgard marked this pull request as draft December 18, 2024 17:57
case GE_COMP_ALWAYS:
while (w >= 8) {
_mm_storeu_si128(ptr, valueX8);
ptr++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but incrementing possibly-misaligned pointer feels a bit iffy (though compilers probably won't do anything weird here), compared to only casting inside _mm_storeu_si128(...).

Copy link
Owner Author

@hrydgard hrydgard Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this is fine actually, _mm_storeu_si128 is meant to handle misaligned pointers, unlike _mm_store_si128. (although on modern CPUs, they are the same).

I might check for alignment and do separate loops later, but in all cases in the relevant games that I've seen, this is just used for clearing the background and is aligned.

beta += A1,
gamma += A2)
{
int mask = alpha >= 0 && beta >= 0 && gamma >= 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A classic trick is int mask = (alpha | beta | gamma);, followed by if(mask < 0) continue;.
Extends to SIMD by _mm_movemask_ps to check the 4 signs (on x86, not sure about NEON).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum: this is what SoftGPU does for NEON:

#elif PPSSPP_ARCH(ARM64_NEON)

float previousDepthValue = (float)depthBuf[idx];

int depthMask;
switch (compareMode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to template on compareMode (with added dispatcher function with a runtime argument, that selects one of the templates)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. that will come later.

@hrydgard
Copy link
Owner Author

@fp64 this is not meant to showcase properly optimized code yet - first step it getting it to work, then I'll SIMD it properly. So no need to review yet, this is in draft mode :)

@hrydgard hrydgard force-pushed the software-depth-proto branch 3 times, most recently from 24be2a7 to 8fa1429 Compare December 20, 2024 14:32
@hrydgard
Copy link
Owner Author

CrossSIMD.h is becoming really useful, it was very easy to SIMD-ify DepthRasterClipIndexedTriangles for a 2x+ speed boost.

Now, the main thing taking time is of course DepthRasterTriangle. But before taking on that, I'll have to implement proper cull modes.

@hrydgard hrydgard force-pushed the software-depth-proto branch from e104976 to 2b24230 Compare December 20, 2024 19:35
@hrydgard
Copy link
Owner Author

Some progress:

  • Now correctly rasterizes depth of large scenes in Wipeout, Midnight Club etc, fixing lens flare effects
  • My "CrossSIMD" classes for automatic SSE+NEON support is working out very well
  • Everything up to the rasterizer itself has now been optimized. The rasterizer itself though is totally unoptimized, so it eats like 15-20% CPU on a fast PC, which isn't good enough.

Unfortunately right now Syphon Filter is broken, but will fix again of course.

@hrydgard
Copy link
Owner Author

Alright, the inner loop is now SIMD-optimized and works on x86-64 and ARM.

Unfortunately it's really slow in debug mode (unsurprisingly), so some #pragma optimize might be motivated...

I'm gonna get this in soon, but not enable it in compat.ini just yet, it needs to be multithreaded too first to minimize the performance impact.

@hrydgard hrydgard marked this pull request as ready for review December 21, 2024 12:42
@hrydgard
Copy link
Owner Author

@fp64 you're welcome to have another look now :)

(I do know that some things can do with more optimization, like the triangle setup should be done for four triangles in parallel, and to support that, the x/y/z arrays should be reorganized).

@hrydgard hrydgard force-pushed the software-depth-proto branch from 0dd55d9 to 80cb57f Compare December 21, 2024 13:32
@hrydgard hrydgard changed the title Software depth buffer prototype Render a software depth buffer in parallel with HW rendering Dec 21, 2024
@hrydgard
Copy link
Owner Author

Alright, I'm gonna get this in as-is. It's disabled by default, but just use compat.ini as mentioned above to try it. You can use the Pixel Viewer in the new Ge debugger to inspect the rendered result.

@fp64
Copy link
Contributor

fp64 commented Dec 22, 2024

Some comments, in no particular order.

// NOTE: This uses a CrossSIMD wrapper if we don't compile with SSE4 support, and is thus slow.

Out-of-date comment? Since wrapper is now used regardless (and wouldn't the alternative be bad, as _M_SSE is set to 0x402 on windows?).

I see that this ignores top-left rule, same as
Intel's Software Occlusion Culling demo does. Probably ok for approximate rasterization. What does real PSP do, anyway? SoftGPU's IsRightSideOrFlatBottomLine is probably not quite right.

Is there a risk of integer overflow in edge function computations? If all vertices are inside 480x272 rectangle, then no (even with supbixel precision), but are triangles clipped to screen beforehand (which can introduce extra vertices)? And is resolution always 1x?

Trying to figure out how much SIMDification of triangle setup would gain.
Back when I was testing the speed of my pure-depth-write rasterizer (warning: somewhat buggy, at least top-left stuff is probably wrong) it looked like this (on whatever machine rextester uses to run code):

Tris CCW Tris Pixels AvgArea Time,us T/tri,ns T/px,ns
1000000 499981 39421 0.08 67200.0 67.2 1704.7
100000 49926 15238 0.31 6890.0 68.9 452.2
100000 49926 61009 1.22 8183.0 81.8 134.1
100000 49926 243212 4.87 11112.0 111.1 45.7
10000 4923 94122 19.12 1764.0 176.4 18.7
10000 4923 368064 74.76 3843.0 384.3 10.4
10000 4923 1416831 287.80 8410.0 841.0 5.9
1000 486 515371 1060.43 1913.0 1913.0 3.7
1000 486 1766270 3634.30 4570.0 4570.0 2.6
1000 486 4559285 9381.24 8659.0 8659.0 1.9

This is SSE2 for rasterization (but it uses 2x2 quads, and scalar stores somehow; using store2x64() would be probably faster), and scalar triangle setup (all single-threaded).
The plot doesn't actually fit a nice time = A*tris+B*pixels line,
but very roughly we have ~100ns per triangle plus ~2ns per pixel. That gets us ~1ms at ~10k triangles.

For large triangles I got around x1.4 speedup by splitting them into 8x8 blocks, and special-casing empty and full blocks.
You can test it in my example above by setting

bool use_blocks=false;

It might be cheaper to just step (floating-point) Z, rather than compute it from barycentrics (certainly, if you special-case full blocks not to check - or at all use - barycentrics at all). Computing interpolants from barycentrics may reduce register pressure if there are many of them (texcoords, normals, etc.) - but we only have Z. There might be accuracy impact (edge function increments are exact; floating-point dZdX, dZdY might not be), but hopefully small.

Speaking of, the DepthRasterTriangle() interpolates Z, not 1/Z. SoftGPU seems to be doing the same, so I assume this is what PSP does?

Unrolling inner loop 2x to load/store 128 bits at a time (and not 64 like now) might be a win (Vec8U16 instead Vec4U16).

Are tileStartX, tileEndX always multiples of 4? Otherwise code may stomp outside scissor, it seems.

About

		// Use a couple Newton-Raphson steps to refine the estimate.
		// May be able to get away with only one refinement, not sure!

My understanding is that VRECPE of 1.0f is 511/512, and after the 1st step you get 1-2^(-18), which is not quite enough if you use it to implement division.

@hrydgard
Copy link
Owner Author

hrydgard commented Dec 22, 2024

Thanks for looking!

Not quite an outdated comment, but I haven't ended up using that.

Since we are drawing directly to PSP VRAM at its native resolution, where the real hardware draws, yes, it's always 1x.

Integer overflow might be a possibility, yes. I should reject triangles if any corner is outside a 4096x4096 box, just like the PSP itself does (it doesn't clip on the sides).

Yeah, doing a hierarchical rasterizer with bigger blocks may be interesting. No triangles are really huge though because we're limited to a resolution of 480x272, but it still may be worth it. It will add some extra complexity though for sure.

All modern GPUs, including ones as far back as the PSP indeed, just interpolates Z (which is z/w) becaused it's linear in screen space, so you don't need to divide per pixel.

I've already changed it to step Z instead of computing from barycentrics, in #19758 . It's slightly more setup though so it might hurt very small triangles, however I also started rejecting triangles with an area under 10 5 pixels (we are computing area*2) which seems to be perfectly fine for the purpose, and saves a bit of CPU.

I am thinking of moving the bbox calculation out to DepthRasterClipIndexedTriangles, to avoid even queueing up discarded triangles, that will make sure that a SIMD-ified setup will actually always chew on something meaningful.

Because right now, about 70-85% of triangles are (correctly) rejected before we enter the raster loop, which is more than I expected! Either because too small, or hit a screen border, or backface culled.

The question with unrolling the innerloop is if it's better to go to 4x2 blocks or 8x1 blocks. My intuition says the former, although we won't get the benefit of nice 128-bit loads/stores then... And also, not sure if we have enough registers left for it to be a win.
I think 2x2 blocks are likely not worth it - it would be if we could store the depth buffer swizzled, but we can't.

I do need to pay more attention to the tileStartX/scissor, you're right. Though games using odd-sized or odd-positioned scissors are rare, and I don't think any of the relevant games do.

@fp64
Copy link
Contributor

fp64 commented Dec 22, 2024

Unolled 8x1 fits nicely into "fully-unrolled 8x8", if one wants to go that way. For small triangles - perhaps not great.
Needing extra regs might hurt, yes.

Just in case you were not aware: there's a neat trick to effectively have more bits before we hit integer overflow - since we step the full pixels, the lower bits of edge functions do not change (and therefore do not carry into the sign bit), and do not need to be stored. With 4-bit subpixel that gets us 8192x8192 render target (instead of 2048x2048 without this trick) using int32 (which would nicely contain 4096x4096 box you mentioned). It does require int64 at triangle setup, hurting its SIMDability, though. Not sure how important subpixel accuracy is for this task. Without it, the render target is comfortable 32768x32768, fully in int32.

@hrydgard
Copy link
Owner Author

For the purposes of lens flare occlusion, I don't think subpixel is very meaningful at all, really. In the games I've tried, the current setup works perceptually perfectly. But yeah, I've read ryg's blog, so I know about the trick :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants