Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depth raster TODO list #19757

Open
3 of 7 tasks
hrydgard opened this issue Dec 22, 2024 · 6 comments
Open
3 of 7 tasks

Depth raster TODO list #19757

hrydgard opened this issue Dec 22, 2024 · 6 comments
Labels
GE emulation Backend-independent GPU issues
Milestone

Comments

@hrydgard
Copy link
Owner

hrydgard commented Dec 22, 2024

This is about #19748 , which solves a number of lens flare issues across various games, at the cost of running an extra Z-only software renderer.

Ideally other games should run and render good depth buffers too, so we get the bugs out of the system, even when they don't have any use for them.

Problematic things (done)

  • Quake II (homebrew), Suicide Barbie - generate a 0xFFFF depth buffer
  • Tekken 6 hangs in a broken display list (I guess we write some memory we shouldn't)

Features:

  • Add a setting to control it under Speed Hacks

Planned optimizations:

  • Hierarchical rasterization for large triangles
  • Raster the screen in tiles, across multiple threads. Ideally we should do binning, although on the other hand, triangle raster time is so dominant that maybe it's ok to just send all draws to all threads.
  • SIMD-ify triangle setup, do four triangles at a time (all the way from the clip function)
  • Queue up draws, run them "in the background" on the threads and only flush on render target switches.
@fp64
Copy link
Contributor

fp64 commented Dec 22, 2024

Not sure if this comment belongs here, but.
Regarding triangle setup, just had a thought: wouldn't it be possible to just use _mm_madd_epi16 for integer multiplication for edge functions?
Possibly zeroing out high 16 bit of each 32-bit lane of one of the arguments (no need to patch both) - a "sign-retract", if you will.
This assumes all vertex coords are in [-32768;+32767], but you want that regardless, to avoid overflow.

@hrydgard
Copy link
Owner Author

Yes, that will probably be fine, because both multiplicands are small. Certainly better than the horror of the workaround function :)

Or maybe it's okay to do the triangle setup in float? Although, I'm sure Fabian has a good reason to stick to int..

Btw, in your https://rextester.com/GDHNO44482 , for the hiearchical traversal, I'm pretty sure that you don't have to do four tests like in your test_rect, it should be possible to bias the edge functions instead and do a single test even at the upper level. Though, have not tried that :)

@fp64
Copy link
Contributor

fp64 commented Dec 23, 2024

You can even do:

// Returns (a-b)*(c-d)-(e-f)*(g-h) per int32 lane,
// assuming all (...)'s fit into int16.
static __m128i edge_function(__m128i a,__m128i b,__m128i c,__m128i d,__m128i e,__m128i f,__m128i g,__m128i h)
{
    __m128i p=_mm_sub_epi32(a,b);
    __m128i q=_mm_sub_epi32(c,d);
    __m128i r=_mm_sub_epi32(e,f);
    __m128i s=_mm_sub_epi32(h,g); // flipped order, since _mm_madd_epi16 is p*q+r*s, not p*q-r*s.
    __m128i x=_mm_or_si128(_mm_and_si128(p,_mm_set1_epi32(0xFFFF)),_mm_slli_epi32(r,16));
    __m128i y=_mm_or_si128(_mm_and_si128(q,_mm_set1_epi32(0xFFFF)),_mm_slli_epi32(s,16));
    return _mm_madd_epi16(x,y);
}

Tested it, seems to work fine.
You can also have 4-argument version (just pq-rs), though that needs one extra negation.
Win over the naive SSE2 version seems surprisingly small though (~1.33), and it's ~1.4 times worse than straightforward SSE4 version.

Triangle setup/rasterization are done in int pretty much for reasons of exactness: you want to make sure pixels on common edge of 2 triangles are rendered exactly once.
The sameself ryg mentioned somewhere (Twitter, I think, so good luck searching that) that you can use float in certain circumstances: you only care about exactness around where the edge function can change sign in the first place (and float gives you much better range - 2^126 vs 2^31, though not precision). But that needs care, and he didn't go into detail.
Normally, float32 products are exact up to 24 bits, so you are worse off than int32.
Now, if you don't care about exactness (as evidenced by not doing top-left rule, as well as skipping small triangles) - perhaps float is alright. I seem to recall that doing rasterizer in float doesn't produce much visible artifacts in practice.

You mean computing edge function at rect center, and comparing it to sum of absolute values of increments for rect half-sides? Oh, yeah, that should work, nice.

@hrydgard
Copy link
Owner Author

Nice! That'll come in handy. I'll stick to integer...

By the way, I was driving today and thinking of rasterization hehe. I had two thoughts:

In your 8x8 raster, if there is a block to the left of the current one, you can reuse the top right and bottom right samples as the new top left and bottom left.

But also, I don't understand how your 8x8 method with checking corners doesn't miss small triangles, like a tiny one entirely enclosed by a block...

So it feels like rastering at 8x8 centres with bias, and then, in "inside" blocks, checking corners with the 1x1 biases to see if a block is full or partial would be the way to go?

@fp64
Copy link
Contributor

fp64 commented Dec 23, 2024

The block test looks at 12 bits of data: signs of 3 edge functions at block corners.
The block is considered "empty" iff at least one edge function has negative signs at all 4 corners, i.e. the block is entirely in the "outside" half-plane (since half-plane is a convex shape).
This is conservative: there may be some blocks that aren't discarded as "empty", but are, in fact, empty.
However, blocks that are discarded are really empty.

So tiny triangle enclosed by a block poses no problem: the signs of indivadual edge functions at corners would be different.

@hrydgard
Copy link
Owner Author

Ahhh ok, I understand now :) With reuse, checking the corners may practically be as fast as using a rect-center biased check I guess, since with that method we still need to check the corners to see if a block is fully in or partial..

I'm going to play around with this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GE emulation Backend-independent GPU issues
Projects
None yet
Development

No branches or pull requests

2 participants