The main slow-down with the C65GS is the wait-state on reading chipram. I had tried various ways to supress the wait-state at its root cause in the FPGA dual-ported block RAM without luck. Then it occurred to me this morning that I could make a single-port shadow RAM that shadows all of chipram. So writing to chipram writes to both, and reads by the CPU would be sourced from the chipram -- with no wait-state.
So as a reminder of the state of affairs before todays improvements:
Removing the wait state on chipram by implementing the shadow RAM had quite a nice impact:
Functional calls are about 30% faster, and RAM operations in general are all moderately improved, as might be expected. This also got bouldermark quite a bit faster.
In the process I realised what should have been obvious to me, that implied/accumulator mode single-byte instructions were still taking two cycles, and could be easily reduced to one cycle. This makes NOPs run at an amazing 71x, and pushed the overall rating up a little to 26.9x:
BoulderMark now indicates just over 55x. I am still at a loss why the machine is so much faster than a stock C64 for BoulderMark, but the same phenomena is visible with the latest version of the Chameleon that gets a rating of around 14,000 (see http://wiki.icomp.de/wiki/C64_Benchmarks). That's a mystery that will have to remain for now.
In the meantime, I have a couple more ideas to improve performance that I will try.