The conditions were in reverse order (maybe someone was reading the
PowerPC manual and forgot about IBM's bit numbering), and additionally
the two conditions for unsigned comparison were wrong.
Fixes https://bugs.dolphin-emu.org/issues/14054.
Accomplished using `run-clang-tidy` with `performance-move-const-arg,performance-unnecessary-value-param,modernize-pass-by-value`.
Changed arguments to const references, removed them where inappropriate (e.g. sink parameters). Same with std::move.
Manually reviewed each change to make sure that it makes sense, and do something more appropriate if possible.
Unmapped on the physical level, not the MMU level.
Fixes booting Game Boy Interface. Previously, Game Boy Interface thought
it was running on a Wii because accessing MEM2 didn't raise a PI
interrupt, and as a result tried to exit to the Homebrew Channel in a
way Dolphin's HLE doesn't recognize. (Dolphin's HLE catches jumps to
0x80001800, but GBI is running without address translation at this point
and therefore jumps to 0x00001800 instead.)
Jit64::dcbz's fast path bypasses the dcache, so we shouldn't use it if
accurate dcache is turned on. This fixes the graphical corruption that
would occur in Mario Kart Wii's menu FMVs with accurate dcache.
JitArm64 never had this problem, because it implements dcbz in a
different way. It calls EmitBackpatchRoutine, which already has a check
for accurate dcache.
Some users seem to be under the impression that the panic alert is
saying that enabling MMU will fix the issue, but that's not what it
actually says. Let's try to make this a bit clearer.
This makes JitBaseBlockCache::ErasePhysicalRange around 50% faster and
PPCAnalyzer::Analyze around 40% faster. Rogue Squadron 2's notoriously
laggy action of switching to and from cockpit view is made something
like 20-30% faster by this, though this is a very rough measurement.
This is a very small libary, and as I understand it, it was more or less
developed for Dolphin.
This moves the two relevant files from Externals to Common, changes the
namespace to Common, reformats the code, and adds Dolphin copyright
notices. The change in copyright notice and license was approved by
AdmiralCurtiss.
JIT code related to Branch Watch was emitted if the debugging UI was active: the emitted code would dynamically check whether Branch Watch is active.
However, this causes two problems:
1. It decreases performance by just having the debugging UI enabled
2. It clutters the host assembly in the JIT tab, making it harder to read (unaware readers will wonder what these instructions are for)
With this PR, code related to Branch Watch is emitted only if Branch Watch itself is active, fixing the issues above.
The JIT cache will now be wiped whenever the feature is toggled, causing a slight stutter. However, this isn't the kind of feature that is toggled over and over, so IMO it is an acceptable trade-off.
Removing and readding every page table mapping every time something
changes in the page table is very slow. Instead, let's generate a diff
and ask Memmap to update only the diff.
Page table mappings are only used when DR is set, so if page tables are
updated when DR isn't set, we can wait with updating page table mappings
until DR gets set. This lets us batch page table updates in the Disney
Trio of Destruction, improving performance when the games are loading
data. It doesn't help much for GameCube games, because those run tlbie
with DR set.
The PowerPCState struct has had its members slightly reordered. I had to
put pagetable_update_pending less than 4 KiB from the start so AArch64's
LDRB (immediate) can access it, and I also took the opportunity to move
some other members around to cut down on padding.
Previously we've only been setting up fastmem mappings for block address
translation, but now we also do it for page address translation. This
increases performance when games access memory using page tables, but
decreases performance when games set up page tables.
The tlbie instruction is used as an indication that the mappings need to
be updated.
There are some accuracy downsides:
* The TLB is now effectively infinitely large, which matters if games
don't use tlbie when modifying page tables.
* The R and C bits for page table entries get set pessimistically rather
than when the page is actually accessed.
No games are known to be broken by these inaccuracies, but unfortunately
the second inaccuracy causes a large performance regression in Rogue
Squadron 3. You still get the old, more accurate behavior if Enable
Write-Back Cache is on.
Yellow squiggly lines begone!
Done automatically on .cpp files through `run-clang-tidy`, with manual corrections to the mistakes.
If an import is directly used, but is technically unnecessary since it's recursively imported by something else, it is *not* removed.
The tool doesn't touch .h files, so I did some of them by hand while fixing errors due to old recursive imports.
Not everything is removed, but the cleanup should be substantial enough.
Because this done on Linux, code that isn't used on it is mostly untouched.
(Hopefully no open PR is depending on these imports...)
If all inputs to an fmadds instruction (including cousins like fmsubs,
fnmadd...) are single-precision, then the result is identical between a
double-precision calculation with an error-free transform (whether the
calculation is fused or not) and a single-precision FMA instruction
(must be fused). So as a performance optimization in JitArm64, if we
were going to use double precision with EFT but the inputs are singles,
instead we'll use a normal single-precision FMA instruction without
anything extra. This lets us skip both the EFT and double-to-single
conversions.
Also renaming `inaccurate_fma` to `nonfused` because it's confusing that
`inaccurate_fma` and `m_accurate_fmadds` have such similar names
despite controlling separate things.
If result_reg is set to a temporary register instead of VD because of
accurate NaNs, there's no need to allocate a secondary temporary
register because of inaccurate FMA.
When we're emulating single-precision FMA using an FMA instruction,
there's no precision benefit from using a double-precision instruction,
assuming all inputs are single-precision. But when we're emulating
single-precision FMA using separate multiplication and addition
instructions, there is.
This change increases the precision of inaccurate FMA to the same level
as Jit64, which matters since the only reason we have the inaccurate
FMA mode is for sync compatibility with Jit64.
Combined with the previous commit, this brings the TrampolineInfo struct
down to 48 bytes. This matters, because Jit64 has a big
std::unordered_map where it stores many megabytes of TrampolineInfo
entries.
The key saving comes from shrinking the len member from u32 to u16. It
should be safe to even turn it into a u8, but going that far brings no
additional savings due to how the padding works out.
If we're on an x64 CPU that doesn't have the MOVBE extension, trying to
SwapAndStore a host register results in that register's value getting
clobbered with the swapped value. Jit64::stX and Jit64::stXx detect this
case, and if necessary, emit a MOV to a register that's fine to clobber.
This logic was broken by the merge of PR 12134. Jit64::stX and
Jit64::stXx were assuming that if RegCache::IsImm returns true for a
guest register, calling RegCache::Use or RegCache::BindOrImm for that
guest register would result in an immediate. However, PR 12134 made it
possible for a guest register to have both a host register and an
immediate in the register cache at the same time. When this happens,
RegCache::IsImm returns true, yet RegCache::Use and RegCache::BindForImm
return an RCOpArg whose Location returns a host register. (To make it
extra confusing, RCOpArg::IsImm calls RegCache::IsImm if the RCOpArg
came from RegCache, so RCOpArg::IsImm returns true!)
To fix this, in cases where Jit64::stX and Jit64::stXx explicitly need
an immediate to avoid having to emit an extra MOV, let's call
RegCache::Imm32 so that we're certain that we're getting an immediate.
This fixes an issue on older x64 CPUs that manifested as e.g. completely
broken graphics in Spyro: Enter the Dragonfly.
The constant propagation PR made it so that a guest register can be
present in the register cache as both a host register and an immediate
at the same time. If such a guest register is requested from the
register cache, the register cache prefers returning it as a host
register. However, RCOpArg::IsImm still returns true in this case. This
is confusing, especially since OpArg::IsImm does not return true if the
RCOpArg is converted into an OpArg.
This commit makes RCOpArg::IsImm check whether RCOpArg::Location returns
an immediate, so that RCOpArg::IsImm returns false when a host register
is being used. Code that wants to know whether an immediate exists in
the register cache rather than whether an immediate is currently being
used should call RegCache::IsImm instead.
We have an optimization where the guest carry flag is kept in the host
carry flag between certain back-to-back pairs of integer instructions.
If the second instruction falls back to the interpreter, then
FallBackToInterpreter should flush the carry flag to m_ppc_state,
otherwise the interpreter reads a stale carry flag and at some later
point Jit64 trips the "Attempt to modify flags while flags locked!"
assertion.
An alternative solution would be to not store the guest carry flag in
the host carry flag to begin with if we know the next instruction is
going to fall back to the interpreter, but knowing that in advance is
non-trivial. Since interpreter fallbacks aren't exactly intended to be
super optimized, I went for the flushing solution instead, which is how
JitArm64 already works. In most cases, the emitted code shouldn't even
differ between these two solutions.
Note that the problematic situation only happens if the first integer
instruction doesn't fall back to the interpreter but the second one
does. This used to be impossible because there's no "JIT disable"
setting that's granular enough to disable some integer instructions but
not all, but with the constant propagation PR, it's possible if constant
propagation is able to entirely evaluate the first instruction but not
the second.
Call `UseNoImm` instead of `Use` on parameter `a` of `MultiplyImmediate`
since `Ra` gets passed to `IMUL` which asserts that parameter is not an
immediate.
The optimizations for subfcx introduced in #13852 also apply to subfx.
Rather than duplicating the logic, we merge the handlers, like we did
in #10120 for x86.
When BindToRegister is called, the register cache marks the relevant
guest register as no longer containing an immediate. However, subfcx was
calling GetImm after BindToRegister. This led to a lot of panic alerts
after 2995aa5be4 added an assert to GetImm to check that the passed-in
register is an immediate.
Both before and after 2995aa5be4, the actual value of the immediate
wasn't overwritten by BindForRegister, only the fact that the register
is an immediate. Because of this, the emitted code happened to work
correctly.
Like Jit64, JitArm64 now keeps track of the location of a guest register
using three booleans: Whether it is in ppcState, whether it is in a host
register, and whether it is a known immediate. The RegType enum remains
only for the purpose of keeping track of what format FPRs are stored in
in host registers.