Friday, January 5, 2018

Curious problems and that Super Games banked cartridge

While testing cartridges I am sometimes hitting a rather strange set of symptoms if I have a cartridge plugged in.  What makes it specially weird, is that it only happens if the cartridge is an Ultimax cartridge, and that the problem persists after the cartridge has been removed, until the board has been powered off for a few seconds.

1. Kickstart gets stuck loading the C65 ROM with all sorts of SD card read-errors. Or it could be that Kickstart is not running correctly in some way.

2. Kickstart thinks that some of the Nexys 4's switches are set, even though this is on a MEGA65 r1 PCB, which has those input lines wired to ground in the VHDL, i.e., it is not possible to set those lines at all on this machine.

Also, not sure if it is related, but the USB serial interface also stops working, or stops sending one particular hexadecimal digit, typically one that has several one bits in a row in the ASCII representation, presumably because of a timing error.

The serial interface problems suggest to me that the problem might be timing related.  The SD card problems could also be timing related, for example, with the FPGA generating clock speeds that are too high or too low, or just plain jittery. The FPGA design meets total timing closure, so timing closure in the design can be eliminated as a potential cause.

That this should be caused by the presence of a particular type of C64 cartridge is rather odd.

Then sometimes I am seeing the board thinks the switches are set while booting, but then checking them after exit from the Hypevisor, they have cleared.  This has happened specifically with the Super Games cartridge. My gut feeling is that this is a useful piece of information, and that if I can figure out when the lines go back to all zeros on the switch inputs (even though, as I say, they ought to be provided via the inputs that are specifically assigned to ground in the VHDL), that it will be very useful information.

Ah, now I have something interesting:  The incorrect reading of these two (and several other nearby) registers happens only when in the Hypervisor (indicated by the H- at the end of the register information lines), as this little log of serial monitor interaction shows below. This would also explain the SD card errors, since the problem seems to be reading strange values from SD card registers, as compared to actual problems with the SD card.

First, we start with the MEGA65 in the hypervisor. By writing to $D67F, we trigger exit from the hypervisor back to normal operation (now with -- at the end of the register information lines). Examining the contents of the switch registers ($D6F0 and $D6F1) we see that they are correctly holding zero:

80FC 11 22 33 00 BF BEFF 4000 3F00 AD 7C D6 24 00 ..E..I... ..P 11 -FF H-

8200 11 22 33 00 BF BEFF 4000 3F00 4C 00 82 24 00 ..E..I... ..P 11 R01 H-
.Sd67f 1


8100 11 22 33 00 00 01FF 4000 8F00 4C 00 82 24 00 ..E..I... ..P 11 -FF --
 :777D6F0 00 00 1F E1 FF A0 00 00 00 02 00 7F 7F 80 00 C0

We then switch back to the hypervisor by writing to $D67F again, this time we have to provide the full address, as the Hypervisor trap registers are not visible normally. Now we get a clue: The serial monitor hit a timeout waiting for the CPU to respond.  The CPU is either stuck doing something, or takes longer to respond than it should. Much, much longer, i.e., 65,535 or more cycles instead of no more than about 100 (2x 1MHz cycles), even in the worst case scenario where the memory access is to the cartridge interface, and has just missed the last rising edge of the 1MHz clock.

8102 11 22 33 00 00 01FF 4000 8F00 44 45    26 00 ..E..IZ.. ..P 11 -00 --
.sffd367f 1


However, we do enter the Hypervisor with this, but then we see that our problem of non-zero values in these registers returns. Most curiously, it takes a little while for the problem to fully recur. We also see the 04's turning up in other registers where they shouldn't.


80FC 11 22 33 00 BF BEFF 4000 3F00 44 45    26 00 ..E..IZ.. ..P 11 -FF H-
 :777D6F0 00 04 1F E5 FF A4 04 04 00 04 00 7F 7F 80 00 C4
 :777D6F0 04 04 1F E5 FF A4 04 04 00 06 00 7F 7F 80 00 C4
 :777D6F0 04 04 1F E5 FF A4 04 04 04 04 04 7F 7F 80 00 C4
 :777D6F0 04 04 1F E5 FF A4 04 04 00 06 00 7F 7F 80 00 C4

This all makes me think that there is something funny in the CPU's IO reading logic, which is presumably tickled in some way by the presence of a cartridge. As Hypervisor mode changes the behaviour, it must be some logic that is differentially handled between Hypervisor mode and normal mode.

I tried removing the cartridge, so that the cartridge control lines would float back high, and thus tell the MEGA65 that it doesn't have a cartridge inserted anymore, however, that didn't help.  It did show me that my cartridge line probing function fails to properly charge the lines high before re-scanning, which could cause problems for things like the Action Replay, that fiddle with those lines to change cartridge modes mid-stream.

Looking through the CPU's address resolution logic, there isn't any obvious smoking gun for the cause of this problem. What I need to do next, is to see if I can reproduce the problem in synthesis.  I also have a suspicion that when the problem is happening, the CPU is reading from the cartridge port when accessing these IO locations, or at least thinks it is in some kind of half-hearted way. However, I have been able to confirm that the CPU doesn't take longer in either mode than the other, so there is no reason to suspect that the CPU thinks it is accessing the cartridge port in these cases.

What is also really curious, is that this problem survives FPGA reconfigurations, which are supposed to clear all internal state in the FPGA.  This is even when I have removed the cartridge before triggering the reconfiguration, so the newly loaded FPGA bitstream has no idea that a cartridge had even been inserted.

A brief power off (~0.5 seconds) also doesn't fix it. In fact, removal of power for at least 5 seconds seems to be required!  So where is the state that is being preserved? Even if Xilinx ISE were generating a bad bitstream, it would presumably not be able to cause the keeping of state in this kind of way. And yet there aren't really any external components that seem likely candidates for holding such state for so long.

Anyway, the next step is indeed to see if I can't get this problem to show up in simulation in some way.

My best guess otherwise is that one of the expansion pins is the cause, through the lack of pull-up resistors causing signals to be interpreted as asserted, when they are not supposed to be.

Meanwhile, I had some previous tweaks synthesising, including using only phi2 for cartridge access for the CPU, and trying those with the Super Game cartridge, after a couple of false starts, it is now working in its entirety.  At first, only Silicon Syborgs would start, but after reinserting it a couple of times, all games were working.  So that one might have been a case of contacts still a bit dodgy after cleaning with alcohol. This is quite possible, since alcohol won't remove corrosion, and this is the cartridge that was coated in a goodly layer of what looks like Outback sand.  So, this is rather nice that we have a memory banked cartridge working on the MEGA65 -- this is quite a milestone, as it involves both IO and memory, and in a dynamic manner.   Here are some screen shots:

Also, all the Ultimax cartridges are working again now. But again, I don't know if this is a freak of synthesis, or if the changes I made actually fixed it. The uncertainty is the really annoying par. What I have done, and am currently synthesising, is a feature where I can disable the chip-select lines on most of the internal devices on the IO bus of the MEGA65, so that if the problem is simultaneous driving of the bus, it will get picked up. As part of this I have also made a couple of the IO devices that were checking the address internally, rather than using a chip-select line, to use a chip-select line, so that the whole thing is a bit simpler, and a bit more regular.  Hopefully I can find and fix the problem fairly quickly.

1 comment:

  1. If you can't explain it, it hasn't been fixed! ;)