-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Render a software depth buffer in parallel with HW rendering #19748
Conversation
GPU/Common/DepthRaster.cpp
Outdated
case GE_COMP_ALWAYS: | ||
while (w >= 8) { | ||
_mm_storeu_si128(ptr, valueX8); | ||
ptr++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but incrementing possibly-misaligned pointer feels a bit iffy (though compilers probably won't do anything weird here), compared to only casting inside _mm_storeu_si128(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this is fine actually, _mm_storeu_si128 is meant to handle misaligned pointers, unlike _mm_store_si128. (although on modern CPUs, they are the same).
I might check for alignment and do separate loops later, but in all cases in the relevant games that I've seen, this is just used for clearing the background and is aligned.
GPU/Common/DepthRaster.cpp
Outdated
beta += A1, | ||
gamma += A2) | ||
{ | ||
int mask = alpha >= 0 && beta >= 0 && gamma >= 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A classic trick is int mask = (alpha | beta | gamma);
, followed by if(mask < 0) continue;
.
Extends to SIMD by _mm_movemask_ps
to check the 4 signs (on x86, not sure about NEON).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum: this is what SoftGPU does for NEON:
ppsspp/GPU/Software/Rasterizer.cpp
Line 946 in 53cb014
#elif PPSSPP_ARCH(ARM64_NEON) |
float previousDepthValue = (float)depthBuf[idx]; | ||
|
||
int depthMask; | ||
switch (compareMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to template on compareMode
(with added dispatcher function with a runtime argument, that selects one of the templates)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. that will come later.
@fp64 this is not meant to showcase properly optimized code yet - first step it getting it to work, then I'll SIMD it properly. So no need to review yet, this is in draft mode :) |
24be2a7
to
8fa1429
Compare
CrossSIMD.h is becoming really useful, it was very easy to SIMD-ify DepthRasterClipIndexedTriangles for a 2x+ speed boost. Now, the main thing taking time is of course DepthRasterTriangle. But before taking on that, I'll have to implement proper cull modes. |
e104976
to
2b24230
Compare
Some progress:
Unfortunately right now Syphon Filter is broken, but will fix again of course. |
Alright, the inner loop is now SIMD-optimized and works on x86-64 and ARM. Unfortunately it's really slow in debug mode (unsurprisingly), so some #pragma optimize might be motivated... I'm gonna get this in soon, but not enable it in compat.ini just yet, it needs to be multithreaded too first to minimize the performance impact. |
@fp64 you're welcome to have another look now :) (I do know that some things can do with more optimization, like the triangle setup should be done for four triangles in parallel, and to support that, the x/y/z arrays should be reorganized). |
Checked the output, the generated assembly is great!
0dd55d9
to
80cb57f
Compare
Alright, I'm gonna get this in as-is. It's disabled by default, but just use compat.ini as mentioned above to try it. You can use the Pixel Viewer in the new Ge debugger to inspect the rendered result. |
Some comments, in no particular order. ppsspp/Common/Math/CrossSIMD.h Line 57 in 80cb57f
Out-of-date comment? Since wrapper is now used regardless (and wouldn't the alternative be bad, as _M_SSE is set to 0x402 on windows?).
I see that this ignores top-left rule, same as Is there a risk of integer overflow in edge function computations? If all vertices are inside 480x272 rectangle, then no (even with supbixel precision), but are triangles clipped to screen beforehand (which can introduce extra vertices)? And is resolution always 1x? Trying to figure out how much SIMDification of triangle setup would gain.
This is SSE2 for rasterization (but it uses 2x2 quads, and scalar stores somehow; using For large triangles I got around x1.4 speedup by splitting them into 8x8 blocks, and special-casing empty and full blocks. bool use_blocks=false; It might be cheaper to just step (floating-point) Z, rather than compute it from barycentrics (certainly, if you special-case full blocks not to check - or at all use - barycentrics at all). Computing interpolants from barycentrics may reduce register pressure if there are many of them (texcoords, normals, etc.) - but we only have Z. There might be accuracy impact (edge function increments are exact; floating-point Speaking of, the Unrolling inner loop 2x to load/store 128 bits at a time (and not 64 like now) might be a win ( Are About // Use a couple Newton-Raphson steps to refine the estimate.
// May be able to get away with only one refinement, not sure! My understanding is that |
Thanks for looking! Not quite an outdated comment, but I haven't ended up using that. Since we are drawing directly to PSP VRAM at its native resolution, where the real hardware draws, yes, it's always 1x. Integer overflow might be a possibility, yes. I should reject triangles if any corner is outside a 4096x4096 box, just like the PSP itself does (it doesn't clip on the sides). Yeah, doing a hierarchical rasterizer with bigger blocks may be interesting. No triangles are really huge though because we're limited to a resolution of 480x272, but it still may be worth it. It will add some extra complexity though for sure. All modern GPUs, including ones as far back as the PSP indeed, just interpolates Z (which is z/w) becaused it's linear in screen space, so you don't need to divide per pixel. I've already changed it to step Z instead of computing from barycentrics, in #19758 . It's slightly more setup though so it might hurt very small triangles, however I also started rejecting triangles with an area under I am thinking of moving the bbox calculation out to DepthRasterClipIndexedTriangles, to avoid even queueing up discarded triangles, that will make sure that a SIMD-ified setup will actually always chew on something meaningful. Because right now, about 70-85% of triangles are (correctly) rejected before we enter the raster loop, which is more than I expected! Either because too small, or hit a screen border, or backface culled. The question with unrolling the innerloop is if it's better to go to 4x2 blocks or 8x1 blocks. My intuition says the former, although we won't get the benefit of nice 128-bit loads/stores then... And also, not sure if we have enough registers left for it to be a win. I do need to pay more attention to the tileStartX/scissor, you're right. Though games using odd-sized or odd-positioned scissors are rare, and I don't think any of the relevant games do. |
Unolled 8x1 fits nicely into "fully-unrolled 8x8", if one wants to go that way. For small triangles - perhaps not great. Just in case you were not aware: there's a neat trick to effectively have more bits before we hit integer overflow - since we step the full pixels, the lower bits of edge functions do not change (and therefore do not carry into the sign bit), and do not need to be stored. With 4-bit subpixel that gets us 8192x8192 render target (instead of 2048x2048 without this trick) using int32 (which would nicely contain 4096x4096 box you mentioned). It does require int64 at triangle setup, hurting its SIMDability, though. Not sure how important subpixel accuracy is for this task. Without it, the render target is comfortable 32768x32768, fully in int32. |
For the purposes of lens flare occlusion, I don't think subpixel is very meaningful at all, really. In the games I've tried, the current setup works perceptually perfectly. But yeah, I've read ryg's blog, so I know about the trick :) |
This will fix the lens flare problems in a lot of games: #15923
Many games check if they should draw flares by reading from the depth buffer using the CPU. Unfortunately when we render using the host GPU, it's very, very expensive to read things back.
So instead, this renders an approximate depth buffer into the emulated PSP VRAM directly, using the host CPU.
Approximate - yes, we can take some shortcuts here, while still getting good results. Subpixel precision isn't really needed, and we can probably just skip skinned meshes (although we can of course implement that too if needed).
And there's even more potential trickery we can do to speed it up:
Since there's only a small subset of games that have any use for this, it will be enabled with the compat.ini setting
[SoftwareRasterDepth]
.Before we start enabling it for games, this needs a lot of optimization, and also should run on a separate thread (or threads), syncing up on every framebuffer switch or maybe on every DrawSync. Although in debug mode, it should go draw by draw.
Potential inaccuracies:
The new GE debugger visualizations helped a lot:
Note to self: The madd trick from https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ may be useful for SIMD-ing the triangle setup on SSE.