-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to UML and DRC backends #13108
base: master
Are you sure you want to change the base?
Conversation
…s bug fixes to be consistent with the C backend
…rious bug fixes to be consistent with the C backend
I think this nomenclature is a bit misleading. On the Xtensa architecture, instructions with exactly these mnemonics return the most significant 32 bits of the product. |
Do you have a better suggestion for naming then? I used the H to mean half here since it's only returning half of the result of the operation. |
I would probably use 'L' (low) as the suffix instead, though that's unfortunately not unambiguous either because ARM uses 'L' for long multiply. Xtensa, for its part, has only one MULL instruction because, like some other RISC architectures, it doesn't bother detecting signed or unsigned overflow. |
Playing off using L then: we don't need to keep the opcode names as minimal as possible especially since it's just a language used for MAME's DRC, so what about something like MULULOW/MULSLOW instead? I doubt that should clash with any existing arch's naming and it feels like it should be unambiguous as to what the opcodes are doing. |
Yours and current fail on my PPC opcode validator on the C backend, but not x64. Looks purely like flag update issues. Mismatch: instr=SUBFMEO, src1=0x7fffffff, src2=0x0 Mismatch: instr=SUBFMEO., src1=0x7fffffff, src2=0x0 Mismatch: instr=SUBFZEO, src1=0x7fffffff, src2=0x0 Mismatch: instr=SUBFZEO., src1=0x7fffffff, src2=0x0 |
@rb6502 How can I run that validator? I have an idea of what might be happening but I need to verify it first. |
You could call |
The PowerPC instruction mnemonics for these kinds of operations are:
It’s possibly just conditioning, but the PowerPC names feel logical for me. |
src/devices/cpu/drcbex64.cpp
Outdated
a.mov(rcx, hireg); | ||
a.cqo(); | ||
|
||
a.cmp(rcx, hireg); | ||
a.pushfd(); | ||
a.pop(hireg); | ||
a.and_(hireg, 0x40); // zero | ||
a.xor_(hireg, 0x40); | ||
a.shl(hireg, 5); // turn into overflow flag | ||
a.or_(dword_ptr(rsp, 0), hireg); | ||
|
||
a.popfd(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are potential issues with storing flags on the stack:
- x86-64 ABIs expect the stack pointer to be static and remain 16-byte aligned outside function prologues and epilogues. Violating this can cause problems with signal handler and interrupts.
pushfd
andpopfd
tend to cause pipeline stalls and hurt performance.
Can you use lahf
and sahf
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86-64 ABIs expect the stack pointer to be static and remain 16-byte aligned outside function prologues and epilogues. Violating this can cause problems with signal handler and interrupts.
Is that really an issue here compared to any other code using a form of push/pop? I was careful about adding a corresponding pop (or stack pointer adjustment using lea) for every push I added so the stack should always end up in the same place it started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not that the stack needs to be left where it started, it’s that the stack pointer must not move outside function prologues and epilogues. You just don’t push/pop in the main bodies of functions on x86-64.
This kind of requirement isn’t unique to x86-64 ABIs. PowerPC ABIs require R1 to always be aligned and always point to the previous R1 value (so you need to use stwu
or stdu
to create the stack frame, and can’t move the stack pointer in the main function body) and AArch64 enforces stack alignment in hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used r10 and r11 as scratch registers to handle juggling register values. I double checked and for Windows r10 and r11 considered volatile so I think it should be safe to use them. The header of the x64 DRC also says for Linux/MacOS the r10 and r11 registers are volatile.
x86 uses pushfd/popfd a lot more and there aren't really any readily available scratch registers to make it easier to implement so I don't think it's worth the time to optimize that for x86. Someone else can handle that if they think it's worth the effort though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used r10 and r11 as scratch registers to handle juggling register values. I double checked and for Windows r10 and r11 considered volatile so I think it should be safe to use them. The header of the x64 DRC also says for Linux/MacOS the r10 and r11 registers are volatile.
Yep, R10 and R11 are volatile in SysV and Windows calling conventions. (R10 is static link on entry for nested functions in SysV, but that isn’t relevant for us).
x86 uses pushfd/popfd a lot more and there aren't really any readily available scratch registers to make it easier to implement so I don't think it's worth the time to optimize that for x86. Someone else can handle that if they think it's worth the effort though.
I agree, not worth the effort at this point. The stack only needs to be aligned at the point of calls for i686, so isn’t overly important, just room for optimisation.
// this assumes that the flags from the low 32-bit op are on the stack | ||
// and the flags from the high 32-bit op are live | ||
a.pushfd(); // pushf | ||
a.mov(ecx, ptr(esp, 4)); // mov ecx,[esp+4] | ||
a.or_(ecx, ~0x40); // or ecx,~0x40 | ||
a.and_(ptr(esp, 0), ecx); // and [esp],ecx | ||
a.popfd(); // popf | ||
a.lea(esp, ptr(esp, 4)); // lea esp,[esp+4] | ||
a.pushfd(); | ||
|
||
a.mov(ecx, dword_ptr(esp, 4)); // zero flag | ||
a.or_(ecx, ~0x40); | ||
a.and_(dword_ptr(esp, 0), ecx); | ||
|
||
a.popfd(); | ||
a.lea(esp, ptr(esp, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more complex to do given the code is already structured around using the stack, but this would get a performance benefit from avoiding pushfd
/popfd
, too.
1090e24
to
d0a4937
Compare
I rewrote the logic for the C backend's SUBB (32-bit and 64-bit) to match the results of x86/x64. I checked the i386 CPU's SBB32 implementation and that was using a uint64_t to check for the carry flag which makes sense for 32-bit but wouldn't work for the 64-bit version of SUBB, so I ended up just checking for the specific case where x+carry would result in 0 when carry was non-0 (for the x=0, carry=0 case). The results between x86, x64, C backend, and ARM (sbcs) are all agreeing with the results of a few edge cases I tested against, including the above mention 0xffffffff - 0x7fffffff with carry set. If the PPC instruction still fails in the test program then I think it needs to be handled in the frontend. |
d0a4937
to
eb697c5
Compare
Never seen that function before. I tested it out and it worked when I had x32dbg/x64dbg attached to mame.exe and did nothing without the debugger attached, so definitely an improvement. Thanks. |
@@ -1484,7 +1484,7 @@ void drcbe_x64::op_nop(Assembler &a, const instruction &inst) | |||
|
|||
void drcbe_x64::op_break(Assembler &a, const instruction &inst) | |||
{ | |||
a.int3(); | |||
smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to put a pointer to a string for the function argument in RCX/RDI (depending on the calling convention) for the argument to osd_break_into_debugger
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeaaah completely forgot it took a parameter.
I ended up just implementing trampoline functions because these commands will be called very infrequently so performance isn't an issue, and it lets the compiler handle anything that it needs to. For x86 in particular it was doing stack pointer adjustments in the prologue and epilogue instead of just putting it at esp like I implemented initially, so it seems the safer choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trampoline for x86 needs to do that to keep the stack aligned because it’s a non-leaf function. The stack will already be aligned if you haven’t done any unbalanced pushes/pops, so you can just place the value at (%esp) in the same way that op_callc
does.
src/devices/cpu/drcbex64.cpp
Outdated
void break_into_debugger() | ||
{ | ||
static const char *message = "break from drc"; | ||
osd_break_into_debugger(message); | ||
} | ||
|
||
void drcbe_x64::op_break(Assembler &a, const instruction &inst) | ||
{ | ||
smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax); | ||
a.int3(); | ||
smart_call_r64(a, (x86code *)(uintptr_t)&break_into_debugger, rax); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn’t need the trampoline function, e.g. see what op_callc
does:
void drcbe_x64::op_callc(Assembler &a, const instruction &inst)
{
// validate instruction
assert(inst.size() == 4);
assert_any_condition(inst);
assert_no_flags(inst);
// normalize parameters
const parameter &funcp = inst.param(0);
assert(funcp.is_c_function());
be_parameter paramp(*this, inst.param(1), PTYPE_M);
// skip if conditional
Label skip = a.newLabel();
if (inst.condition() != uml::COND_ALWAYS)
a.short_().j(X86_NOT_CONDITION(inst.condition()), skip); // jcc skip
// perform the call
mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)paramp.memory()); // mov param1,paramp
smart_call_r64(a, (x86code *)(uintptr_t)funcp.cfunc(), rax); // call funcp
// resolve the conditional link
if (inst.condition() != uml::COND_ALWAYS)
a.bind(skip); // skip:
}
Also, missing one const
for the pointer.
You should be able to do something like:
static char const *const message = "break from drc";
mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)message);
smart_call_r64(a, (x86code *)(uintptr_t)&break_into_debugger, rax);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it in a different comment, but that's what I was doing for x64 but x86 was doing something a bit different so I thought it'd be easier to implement them the same way and leave it up to the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should be:
static char const *const message = "break from drc";
mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)message);
smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. 7da0d8e
25b8ab9
to
7da0d8e
Compare
Just to make it more permanent for the PR, RB tested the PPC test again with the new SUBB implementation and it appears to have cleared up all of the failing tests: "ppctest passes 100% now with the C backend, same as it did with x64. Great!" |
I don’t have any other objections to the code change itself at this point. I do agree with @ajrhacker that people familiar with “real” CPU architectures are likely to be confused by using “H” in mnemonics to mean an instruction calculates the least significant part of the result. In general “real” architectures use “h”/“l” suffixes of some sort for instructions that yield the upper and lower parts of the result of an operation. |
I don't really have any particular preference for naming, so I'll change it to |
Do you actually need two instructions, though? The least significant half of the full multiplication result should be identical for signed and unsigned multiplication. |
The previous implementation wanted to calculate the status flags differently based on whether it was meant to be a 32x32=32 (where the output register for the lower half matched the upper half) or 32x32=64 (when the lower and upper half had different output registers) multiplication. For example, if the top half register wasn't specified in the MULU/MULS opcode then the sign bit would be set if the multiplication resulted in 0x00000001ffffffff and only the 0xffffffff is being returned. When the top half register was specified it would set the sign bit to 0 for the same input values. I think it warrants splitting off the commands because I think having status flags calculated based on only the lower 32-bits when the upper half's register isn't specified isn't all that explicit in what it's going to do, and that unintuitiveness resulted in the MULS opcode in the C backend being broken for like the last 12 years. |
src/devices/cpu/uml.cpp
Outdated
OPINFO3(MULUH, "!muluh", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Unsigned 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result) | ||
OPINFO3(MULULW, "!mululw", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Unsigned 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result) | ||
OPINFO4(MULS, "!muls", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=64 and 64x64=128 multiplication | ||
OPINFO3(MULSH, "!mulsh", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result) | ||
OPINFO3(MULSLW, "!mulslw", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m pretty tired at the moment, but the comment doesn’t make sense: “overflow set based on 32x32=64 calculation”.
It shouldn’t be possible for a 32*32->64 multiply to overflow. The largest possible values will always fit in a 64-bit result.
For unsigned:
- 0xffff'ffff * 0xffff'ffff = 0xffff'fffe'0000'0001
For signed:
- 0x7fff'ffff * 0x7fff'ffff = 0x3fff'ffff'0000'0001
- 0x7fff'ffff * 0x8000'0000 = 0xC000'0000'8000'0000
- 0x8000'0000 * 0x8000'0000 = 0x4000'0000'0000'0000
So is “overflow” representing something other than arithmetic overflow? Because if it represents arithmetic overflow, it will always be zero, and the results will be the same for the signed and unsigned versions of this instruction (which is why e.g. PowerPC only has mullw
, since results for the lower half are identical for signed/unsigned, while it has mulhw
and mulhwu
because the upper half needs different handling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going based off the x86 documentation because the flags are calculated the same way:
If the high-order bits of the product are 0, the CF and OF flags are cleared; otherwise, the flags are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86 flags are pure evil.
This is sign-extending and checking if the result is zero? Why not just sub
instead and not pushf at all?
…On Mon, Dec 23, 2024, 6:32 AM Vas Crabb ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/devices/cpu/drcbex64.cpp
<#13108 (comment)>:
> + a.mov(rcx, hireg);
+ a.cqo();
+
+ a.cmp(rcx, hireg);
+ a.pushfd();
+ a.pop(hireg);
+ a.and_(hireg, 0x40); // zero
+ a.xor_(hireg, 0x40);
+ a.shl(hireg, 5); // turn into overflow flag
+ a.or_(dword_ptr(rsp, 0), hireg);
+
+ a.popfd();
I used r10 and r11 as scratch registers to handle juggling register
values. I double checked
<https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170#x64-register-usage>
and for Windows r10 and r11 considered volatile so I think it should be
safe to use them. The header of the x64 DRC also says for Linux/MacOS the
r10 and r11 registers are volatile.
Yep, R10 and R11 are volatile in SysV and Windows calling conventions.
(R10 is static link on entry for nested functions in SysV, but that isn’t
relevant for us).
x86 uses pushfd/popfd a lot more and there aren't really any readily
available scratch registers to make it easier to implement so I don't think
it's worth the time to optimize that for x86. Someone else can handle that
if they think it's worth the effort though.
I agree, not worth the effort at this point. The stack only needs to be
aligned at the point of calls for i686, so isn’t overly important, just
room for optimisation.
—
Reply to this email directly, view it on GitHub
<#13108 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWSFM5ALXNHLMPLHBV4BIL2HANG3AVCNFSM6AAAAABUBXIAVSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKMRQGY4TMMJYG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@peterferrie You're commenting on old code that has since been changed. Did you read the code at all beyond the snippet shown in the email message? You need context to actually understand what the code was trying to do and that snippet didn't give you the required context. Having said that, I looked over the code again and did find a better way to simplify the logic for that specific part of the code. |
I was commenting on what was in front of me at the time. Now that I see
lahf/sahf being used for flags, I think you need to try that again.
…On Wed, Dec 25, 2024, 3:30 AM 987123879113 ***@***.***> wrote:
@peterferrie <https://github.com/peterferrie> You're commenting on old
code that has since been changed. Did you read the code at all beyond the
snippet shown in the email message? You need context to actually understand
what the code was trying to do and that snippet didn't give you the
required context.
Having said that, I looked over the code again and did find a better way
to simplify the logic for that specific part of the code.
—
Reply to this email directly, view it on GitHub
<#13108 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWSFM5MT3E5IGGAUOP5QGD2HKJMZAVCNFSM6AAAAABUBXIAVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRHA2DSMBSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I've been working on a project involving the DRC lately and found several bugs and inconsistencies when trying to compare results so I tried to fix bugs and make things more consistent between all of the backends.
Most of the fixes are to mostly unused code paths so I'm not expecting it to fix many, if any, of the known DRC issues.
I'd appreciate it if people could test the DRC in various games or operating systems before merging so I'm only sending this as a draft PR for now. I've tested mostly SH-4 (Naomi) and PPC (Firebeat) games and I'm not noticing any differences.
a.test(ecx, ecx); // test ecx,ecx
. I didn't get them all because there's too many so mostly just on the code I touched. The comments almost 99% of the time are exactly the same as the assembly code it's beside so they aren't needed.