Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to UML and DRC backends #13108

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

987123879113
Copy link
Contributor

I've been working on a project involving the DRC lately and found several bugs and inconsistencies when trying to compare results so I tried to fix bugs and make things more consistent between all of the backends.

Most of the fixes are to mostly unused code paths so I'm not expecting it to fix many, if any, of the known DRC issues.

I'd appreciate it if people could test the DRC in various games or operating systems before merging so I'm only sending this as a draft PR for now. I've tested mostly SH-4 (Naomi) and PPC (Firebeat) games and I'm not noticing any differences.

  • Added new opcodes
    • BREAK: Hardware breakpoint that can be caught in a debugger to see exactly what assembly is being generated at any given time. (Warning: will crash the program if used without a debugger attached so use carefully)
    • SETFLGS: Set the hardware flags to a specific state. Meant for debugging such as testing opcodes in various states
    • MULUH and MULSH: Unsigned and signed multiplication where the result is only the lower half and the status flags are calculated based on the lower half. Previously the MULU and MULS opcodes would change how status flags were calculated based on whether the upper half output destination was set to something different than the lower half output destination so I decided to split them off to make their intended usages and status register calculations more explicit.
  • General fixes
    • Standardized how shifts are calculated between the various backends so status flags and results should be the same now
  • UML fixes
    • Fixed broken multiplication and division in the simplifier
    • Fixed shift calculations in the simplifier to work the same as the backends
    • Changed GETFLGS definition such that it will not change status flags in the process of reading out the status flags (and made appropriate changes in backends)
  • C backend
    • Some opcodes wouldn't be recognized depending on the surrounding operation flag usages so I added the cases where I encountered unrecognized opcodes
    • Fixed various broken flag calculations
    • Fixed output when a shift is called with a shift of 0 (previously wouldn't set the output value at all)
  • x86 backend
    • Many of the 64-bit operations were outright broken so I fixed those up as much as I could
    • Fixes to status flag calculations
  • x64 backend
    • Fixed some unsafe code in the FNEG implementation that'd cause segfaults if the case was actually used anywhere (it wasn't)
    • Fixes to status flag calculations
  • Added some extra documentation in uml.cpp for opcodes with cryptic namings and some status flag behavior
  • Removed a lot of useless comments in the format of a.test(ecx, ecx); // test ecx,ecx. I didn't get them all because there's too many so mostly just on the code I touched. The comments almost 99% of the time are exactly the same as the assembly code it's beside so they aren't needed.

@ajrhacker
Copy link
Contributor

MULUH and MULSH: Unsigned and signed multiplication where the result is only the lower half

I think this nomenclature is a bit misleading. On the Xtensa architecture, instructions with exactly these mnemonics return the most significant 32 bits of the product.

@987123879113
Copy link
Contributor Author

I think this nomenclature is a bit misleading. On the Xtensa architecture, instructions with exactly these mnemonics return the most significant 32 bits of the product.

Do you have a better suggestion for naming then? I used the H to mean half here since it's only returning half of the result of the operation.

@ajrhacker
Copy link
Contributor

I would probably use 'L' (low) as the suffix instead, though that's unfortunately not unambiguous either because ARM uses 'L' for long multiply. Xtensa, for its part, has only one MULL instruction because, like some other RISC architectures, it doesn't bother detecting signed or unsigned overflow.

@987123879113
Copy link
Contributor Author

Playing off using L then: we don't need to keep the opcode names as minimal as possible especially since it's just a language used for MAME's DRC, so what about something like MULULOW/MULSLOW instead? I doubt that should clash with any existing arch's naming and it feels like it should be unambiguous as to what the opcodes are doing.

@rb6502
Copy link
Contributor

rb6502 commented Dec 22, 2024

Yours and current fail on my PPC opcode validator on the C backend, but not x64. Looks purely like flag update issues.

Mismatch: instr=SUBFMEO, src1=0x7fffffff, src2=0x0
expected: dest=0x7fffffff, XER=0xe0000000, CR=0x0
got: dest=0x7fffffff, XER=0x20000000, CR=0x0
opcode is 0x7C6305D0

Mismatch: instr=SUBFMEO., src1=0x7fffffff, src2=0x0
expected: dest=0x7fffffff, XER=0xe0000000, CR=0x50000000
got: dest=0x7fffffff, XER=0x20000000, CR=0x40000000
opcode is 0x7C6305D1

Mismatch: instr=SUBFZEO, src1=0x7fffffff, src2=0x0
expected: dest=0x80000000, XER=0x0, CR=0x0
got: dest=0x80000000, XER=0xc0000000, CR=0x0
opcode is 0x7C630590

Mismatch: instr=SUBFZEO., src1=0x7fffffff, src2=0x0
expected: dest=0x80000000, XER=0x0, CR=0x80000000
got: dest=0x80000000, XER=0xc0000000, CR=0x90000000
opcode is 0x7C630591

@987123879113
Copy link
Contributor Author

@rb6502 How can I run that validator? I have an idea of what might be happening but I need to verify it first.

@cuavas
Copy link
Member

cuavas commented Dec 23, 2024

BREAK: Hardware breakpoint that can be caught in a debugger to see exactly what assembly is being generated at any given time. (Warning: will crash the program if used without a debugger attached so use carefully)

You could call osd_break_into_debugger for this, which will check for an attached debugger before blowing up on macOS and Windows.

@cuavas
Copy link
Member

cuavas commented Dec 23, 2024

I think this nomenclature is a bit misleading. On the Xtensa architecture, instructions with exactly these mnemonics return the most significant 32 bits of the product.

Do you have a better suggestion for naming then? I used the H to mean half here since it's only returning half of the result of the operation.

The PowerPC instruction mnemonics for these kinds of operations are:

  • mullw (multiply low word) – Computes the least significant 32 bits of the 64-bit product of two 32-bit integers.
  • mulhw (multiply high word) – Computes the most significant 32 bits of the 64-bit product of two 32-bit integers.
  • mulhwu (multiply high word unsigned) – Computes the most significant 32 bits of the 64-bit product of two unsigned 32-bit integers.

It’s possibly just conditioning, but the PowerPC names feel logical for me.

Comment on lines 3748 to 3759
a.mov(rcx, hireg);
a.cqo();

a.cmp(rcx, hireg);
a.pushfd();
a.pop(hireg);
a.and_(hireg, 0x40); // zero
a.xor_(hireg, 0x40);
a.shl(hireg, 5); // turn into overflow flag
a.or_(dword_ptr(rsp, 0), hireg);

a.popfd();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are potential issues with storing flags on the stack:

  • x86-64 ABIs expect the stack pointer to be static and remain 16-byte aligned outside function prologues and epilogues. Violating this can cause problems with signal handler and interrupts.
  • pushfd and popfd tend to cause pipeline stalls and hurt performance.

Can you use lahf and sahf instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86-64 ABIs expect the stack pointer to be static and remain 16-byte aligned outside function prologues and epilogues. Violating this can cause problems with signal handler and interrupts.

Is that really an issue here compared to any other code using a form of push/pop? I was careful about adding a corresponding pop (or stack pointer adjustment using lea) for every push I added so the stack should always end up in the same place it started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not that the stack needs to be left where it started, it’s that the stack pointer must not move outside function prologues and epilogues. You just don’t push/pop in the main bodies of functions on x86-64.

This kind of requirement isn’t unique to x86-64 ABIs. PowerPC ABIs require R1 to always be aligned and always point to the previous R1 value (so you need to use stwu or stdu to create the stack frame, and can’t move the stack pointer in the main function body) and AArch64 enforces stack alignment in hardware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25b8ab9

I used r10 and r11 as scratch registers to handle juggling register values. I double checked and for Windows r10 and r11 considered volatile so I think it should be safe to use them. The header of the x64 DRC also says for Linux/MacOS the r10 and r11 registers are volatile.

x86 uses pushfd/popfd a lot more and there aren't really any readily available scratch registers to make it easier to implement so I don't think it's worth the time to optimize that for x86. Someone else can handle that if they think it's worth the effort though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used r10 and r11 as scratch registers to handle juggling register values. I double checked and for Windows r10 and r11 considered volatile so I think it should be safe to use them. The header of the x64 DRC also says for Linux/MacOS the r10 and r11 registers are volatile.

Yep, R10 and R11 are volatile in SysV and Windows calling conventions. (R10 is static link on entry for nested functions in SysV, but that isn’t relevant for us).

x86 uses pushfd/popfd a lot more and there aren't really any readily available scratch registers to make it easier to implement so I don't think it's worth the time to optimize that for x86. Someone else can handle that if they think it's worth the effort though.

I agree, not worth the effort at this point. The stack only needs to be aligned at the point of calls for i686, so isn’t overly important, just room for optimisation.

Comment on lines 422 to +431
// this assumes that the flags from the low 32-bit op are on the stack
// and the flags from the high 32-bit op are live
a.pushfd(); // pushf
a.mov(ecx, ptr(esp, 4)); // mov ecx,[esp+4]
a.or_(ecx, ~0x40); // or ecx,~0x40
a.and_(ptr(esp, 0), ecx); // and [esp],ecx
a.popfd(); // popf
a.lea(esp, ptr(esp, 4)); // lea esp,[esp+4]
a.pushfd();

a.mov(ecx, dword_ptr(esp, 4)); // zero flag
a.or_(ecx, ~0x40);
a.and_(dword_ptr(esp, 0), ecx);

a.popfd();
a.lea(esp, ptr(esp, 4));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more complex to do given the code is already structured around using the stack, but this would get a performance benefit from avoiding pushfd/popfd, too.

@987123879113
Copy link
Contributor Author

I rewrote the logic for the C backend's SUBB (32-bit and 64-bit) to match the results of x86/x64. I checked the i386 CPU's SBB32 implementation and that was using a uint64_t to check for the carry flag which makes sense for 32-bit but wouldn't work for the 64-bit version of SUBB, so I ended up just checking for the specific case where x+carry would result in 0 when carry was non-0 (for the x=0, carry=0 case). The results between x86, x64, C backend, and ARM (sbcs) are all agreeing with the results of a few edge cases I tested against, including the above mention 0xffffffff - 0x7fffffff with carry set. If the PPC instruction still fails in the test program then I think it needs to be handled in the frontend.

@987123879113
Copy link
Contributor Author

You could call osd_break_into_debugger for this, which will check for an attached debugger before blowing up on macOS and Windows.

Never seen that function before. I tested it out and it worked when I had x32dbg/x64dbg attached to mame.exe and did nothing without the debugger attached, so definitely an improvement. Thanks.

@@ -1484,7 +1484,7 @@ void drcbe_x64::op_nop(Assembler &a, const instruction &inst)

void drcbe_x64::op_break(Assembler &a, const instruction &inst)
{
a.int3();
smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to put a pointer to a string for the function argument in RCX/RDI (depending on the calling convention) for the argument to osd_break_into_debugger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeaaah completely forgot it took a parameter.

I ended up just implementing trampoline functions because these commands will be called very infrequently so performance isn't an issue, and it lets the compiler handle anything that it needs to. For x86 in particular it was doing stack pointer adjustments in the prologue and epilogue instead of just putting it at esp like I implemented initially, so it seems the safer choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trampoline for x86 needs to do that to keep the stack aligned because it’s a non-leaf function. The stack will already be aligned if you haven’t done any unbalanced pushes/pops, so you can just place the value at (%esp) in the same way that op_callc does.

src/devices/cpu/drcbex86.cpp Outdated Show resolved Hide resolved
Comment on lines 1594 to 1599
void break_into_debugger()
{
static const char *message = "break from drc";
osd_break_into_debugger(message);
}

void drcbe_x64::op_break(Assembler &a, const instruction &inst)
{
smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax);
a.int3();
smart_call_r64(a, (x86code *)(uintptr_t)&break_into_debugger, rax);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn’t need the trampoline function, e.g. see what op_callc does:

void drcbe_x64::op_callc(Assembler &a, const instruction &inst)
{
    // validate instruction
    assert(inst.size() == 4);
    assert_any_condition(inst);
    assert_no_flags(inst);

    // normalize parameters
    const parameter &funcp = inst.param(0);
    assert(funcp.is_c_function());
    be_parameter paramp(*this, inst.param(1), PTYPE_M);

    // skip if conditional
    Label skip = a.newLabel();
    if (inst.condition() != uml::COND_ALWAYS)
        a.short_().j(X86_NOT_CONDITION(inst.condition()), skip);                        // jcc   skip

    // perform the call
    mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)paramp.memory());                        // mov   param1,paramp
    smart_call_r64(a, (x86code *)(uintptr_t)funcp.cfunc(), rax);                        // call  funcp

    // resolve the conditional link
    if (inst.condition() != uml::COND_ALWAYS)
        a.bind(skip);                                                               // skip:
}

Also, missing one const for the pointer.

You should be able to do something like:

    static char const *const message = "break from drc";
    mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)message);
    smart_call_r64(a, (x86code *)(uintptr_t)&break_into_debugger, rax);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in a different comment, but that's what I was doing for x64 but x86 was doing something a bit different so I thought it'd be easier to implement them the same way and leave it up to the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should be:

    static char const *const message = "break from drc";
    mov_r64_imm(a, Gpq(REG_PARAM1), (uintptr_t)message);
    smart_call_r64(a, (x86code *)(uintptr_t)&osd_break_into_debugger, rax);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. 7da0d8e

@987123879113
Copy link
Contributor Author

Just to make it more permanent for the PR, RB tested the PPC test again with the new SUBB implementation and it appears to have cleared up all of the failing tests: "ppctest passes 100% now with the C backend, same as it did with x64. Great!"

@cuavas
Copy link
Member

cuavas commented Dec 23, 2024

I don’t have any other objections to the code change itself at this point.

I do agree with @ajrhacker that people familiar with “real” CPU architectures are likely to be confused by using “H” in mnemonics to mean an instruction calculates the least significant part of the result. In general “real” architectures use “h”/“l” suffixes of some sort for instructions that yield the upper and lower parts of the result of an operation.

@987123879113
Copy link
Contributor Author

I don't really have any particular preference for naming, so I'll change it to MULULW and MULSLW to keep a more familiar naming while also keeping it tied to the existing MULU and MULS commands in the DRC.

@cuavas
Copy link
Member

cuavas commented Dec 23, 2024

I don't really have any particular preference for naming, so I'll change it to MULULW and MULSLW to keep a more familiar naming while also keeping it tied to the existing MULU and MULS commands in the DRC.

Do you actually need two instructions, though? The least significant half of the full multiplication result should be identical for signed and unsigned multiplication.

@987123879113
Copy link
Contributor Author

Do you actually need two instructions, though? The least significant half of the full multiplication result should be identical for signed and unsigned multiplication.

The previous implementation wanted to calculate the status flags differently based on whether it was meant to be a 32x32=32 (where the output register for the lower half matched the upper half) or 32x32=64 (when the lower and upper half had different output registers) multiplication. For example, if the top half register wasn't specified in the MULU/MULS opcode then the sign bit would be set if the multiplication resulted in 0x00000001ffffffff and only the 0xffffffff is being returned. When the top half register was specified it would set the sign bit to 0 for the same input values.

I think it warrants splitting off the commands because I think having status flags calculated based on only the lower 32-bits when the upper half's register isn't specified isn't all that explicit in what it's going to do, and that unintuitiveness resulted in the MULS opcode in the C backend being broken for like the last 12 years.

Comment on lines 180 to 182
OPINFO3(MULUH, "!muluh", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Unsigned 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result)
OPINFO3(MULULW, "!mululw", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Unsigned 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result)
OPINFO4(MULS, "!muls", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=64 and 64x64=128 multiplication
OPINFO3(MULSH, "!mulsh", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result)
OPINFO3(MULSLW, "!mulslw", 4|8, false, NONE, SZV, ALL, PINFO(OUT, OP, IRM), PINFO(IN, OP, IANY), PINFO(IN, OP, IANY)) // Signed 32x32=32 and 64x64=64 multiplication (overflow set based on 32x32=64 calculation but zero and sign based on 32-bit result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m pretty tired at the moment, but the comment doesn’t make sense: “overflow set based on 32x32=64 calculation”.

It shouldn’t be possible for a 32*32->64 multiply to overflow. The largest possible values will always fit in a 64-bit result.

For unsigned:

  • 0xffff'ffff * 0xffff'ffff = 0xffff'fffe'0000'0001

For signed:

  • 0x7fff'ffff * 0x7fff'ffff = 0x3fff'ffff'0000'0001
  • 0x7fff'ffff * 0x8000'0000 = 0xC000'0000'8000'0000
  • 0x8000'0000 * 0x8000'0000 = 0x4000'0000'0000'0000

So is “overflow” representing something other than arithmetic overflow? Because if it represents arithmetic overflow, it will always be zero, and the results will be the same for the signed and unsigned versions of this instruction (which is why e.g. PowerPC only has mullw, since results for the lower half are identical for signed/unsigned, while it has mulhw and mulhwu because the upper half needs different handling).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going based off the x86 documentation because the flags are calculated the same way:
If the high-order bits of the product are 0, the CF and OF flags are cleared; otherwise, the flags are set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86 flags are pure evil.

@peterferrie
Copy link
Member

peterferrie commented Dec 25, 2024 via email

@987123879113
Copy link
Contributor Author

@peterferrie You're commenting on old code that has since been changed. Did you read the code at all beyond the snippet shown in the email message? You need context to actually understand what the code was trying to do and that snippet didn't give you the required context.

Having said that, I looked over the code again and did find a better way to simplify the logic for that specific part of the code.

@peterferrie
Copy link
Member

peterferrie commented Dec 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants