FCVT doesn't necessarily round to zero, so the result
might be inaccurate if we use it. To ensure correct
rounding, we use FCVTS from double FPR to 32-bit GPR.
Unfortunately, FCVTS can't do double FPR to single FPR.
Pretty much the same optimization we did for AVX, although slightly more
constrained because we're stuck with the two-operand instruction where
destination and source have to match.
We could also specialize the case where registers b, c, and d are all
distinct, but I decided against it since I couldn't find any game that
does this.
Before:
66 0F 57 C0 xorpd xmm0,xmm0
66 41 0F C2 C1 06 cmpnlepd xmm0,xmm9
41 0F 28 CE movaps xmm1,xmm14
66 41 0F 38 15 CC blendvpd xmm1,xmm12,xmm0
44 0F 28 F1 movaps xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
66 41 0F C2 C1 06 cmpnlepd xmm0,xmm9
66 45 0F 38 15 F4 blendvpd xmm14,xmm12,xmm0
AVX has a four-operand VBLENDVPD instruction, which allows for the first
input and the destination to be different. By taking advantage of this,
we no longer need to copy one of the inputs around and we can just
reference it directly, provided it's already in a register (I have yet
to see this not be the case).
Before:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
41 0F 28 CE movaps xmm1,xmm14
66 41 0F 38 15 CA blendvpd xmm1,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
C4 C3 09 4B CA 00 vblendvpd xmm1,xmm14,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1
HostIsInstructionRAMAddress uses XCheckTLBFlag::OpcodeNoException,
so we should also use XCheckTLBFlag::OpcodeNoException when reading,
to ensure that we use the IBAT (as opposed to the DBAT) for both.
Doesn't support triggering interrupts when the thermal threshold is
exceeded, but allows polling for temperature information.
The THRM[123] registers are documented in most PPC datasheets, see e.g.
this PPC750CX one: http://datasheets.chipdb.org/IBM/PowerPC/750/750cx_um3-17-05.pdf
Remove the warning:
warning: offsetof within non-standard-layout type ‘JitBlock’ is conditionally-supported
JitBlock contains non-trival types now. Split the fields with trival
types that needs to be access from JIT code into JitBlockData structure.
Changed several enums from Memmap.h to be static vars and implemented Get functions to query them. This seems to have boosted speed a bit in some titles? The new variables and some previously statically initialized items are now initialized via Memory::Init() and the new AddressSpace::Init(). s_ram_size_real and the new s_exram_size_real in particular are initialized from new OnionConfig values "MAIN_MEM1_SIZE" and "MAIN_MEM2_SIZE", only if "MAIN_RAM_OVERRIDE_ENABLE" is true.
GUI features have been added to Config > Advanced to adjust the new OnionConfig values.
A check has been added to State::doState to ensure savestates with memory configurations different from the current settings aren't loaded. The STATE_VERSION is now 115.
FIFO Files have been updated from version 4 to version 5, now including the MEM1 and MEM2 sizes from the time of DFF creation. FIFO Logs not using the new features (OnionConfig MAIN_RAM_OVERRIDE_ENABLE is false) are still backwards compatible. FIFO Logs that do use the new features have a MIN_LOADER_VERSION of 5. Thanks to the order of function calls, FIFO logs are able to automatically configure the new OnionConfig settings to match what is needed. This is a bit hacky, though, so I also threw in a failsafe for if the conditions that allow this to work ever go away.
I took the liberty of adding a log message to explain why the core fails to initialize if the MIN_LOADER_VERSION is too great.
Some IOS code has had the function "RAMOverrideForIOSMemoryValues" appended to it to recalculate IOS Memory Values from retail IOSes/apploaders to fit the extended memory sizes. Worry not, if MAIN_RAM_OVERRIDE_ENABLE is false, this function does absolutely nothing.
A hotfix in DolphinQt/MenuBar.cpp has been implemented for RAM Override.
Utilizing constexpr, we can eliminate the need to construct the tables
at runtime and just do all the work at compile-time. Making for less
moving parts overall.
The general structure is more or less the same, however rather than one
single initialization function, each table is built off an immediately
executed lambda function. This is nice, since it narrows the scope of
the table building logic down to the tables that actually need it.
Similar to what we do for addx. Since we're calculating b - a and
because subtraction is not communitative, we can only apply this when
source register a holds the constant.
Before:
45 8B EE mov r13d,r14d
41 83 ED 08 sub r13d,8
After:
45 8D 6E F8 lea r13d,[r14-8]
We can get away with skipping the addition when we know we're dealing
with a constant zero. Just a MOV will suffice in this case.
Once again, we don't bother to add separate handling for when overflow
is needed, because no titles would ever hit that path during my testing.
Before:
8B 7D F8 mov edi,dword ptr [rbp-8]
83 C7 00 add edi,0
After:
8B 7D F8 mov edi,dword ptr [rbp-8]
ADD has a smaller encoding for immediates that can be expressed as an
8-bit signed integer (in other words, between -128 and 127). MOV lacks
this compact representation.
Since addition allows us to swap the source registers, we can always get
the shortest sequence here by carefully checking if we're dealing with a
small immediate first. If we are, move the other source into the
destination and add the small immediate onto that. For large immediates
the reverse is preferrable.
Before:
41 BE 40 00 00 00 mov r14d,40h
44 03 75 A8 add r14d,dword ptr [rbp-58h]
After:
44 8B 75 A8 mov r14d,dword ptr [rbp-58h]
41 83 C6 40 add r14d,40h
Before:
44 8B 7D F8 mov r15d,dword ptr [rbp-8]
41 81 C7 00 68 00 CC add r15d,0CC006800h
After:
41 BF 00 68 00 CC mov r15d,0CC006800h
44 03 7D F8 add r15d,dword ptr [rbp-8]
When the source registers are a simple register and a constant zero and
overflow isn't needed, emitting LEA is kinda silly.
This will occasionally save a single byte for certain registers due to
how x86 encoding works. More importantly, LEA takes up execution
resources while MOV does not.
Before:
41 8D 7D 00 lea edi,[r13]
After:
41 8B FD mov edi,r13d
When the destination register matches a source register, the other
source register contains zero, and overflow isn't needed, the
instruction becomes a nop and we don't need to emit anything.
We could add specialized handling for the case where overflow is needed,
but none of the titles I tried would hit this path.
Before:
83 C7 00 add edi,0
After:
No functional change, just simplify some repeated logic in the case
where we're dealing with exactly one immediate and one simple register
when overflow isn't needed.
The old logic would always emit LEA when both sources are in a register
and OE is disabled. However, ADD is still preferable when one of the
sources matches the destination.
Before:
45 8D 6C 35 00 lea r13d,[r13+rsi]
After:
44 03 EE add r13d,esi
"ppcState{}" is stored in the .data segment, which means the full ~4 MB
is stored in the executable.
"ppcState" is stored in the .bss segment, which means it only stores a
note that tells it to allocate and zero ~4 MB at runtime.
This was a huge speedup with disabled fastmem, but it still requires the fastmem arena.
So let's disable it for now, even if this commit has a huge performance hit with disabled fastmem.