Sunday, 29 March 2015

Well, I can load the ROM into the DDR RAM now ...

I'll refrain from saying, again, how much I am despising DDR RAM compared with SRAM or SDRAM.  There.

Now, the good news is that I have managed to work around the DDR controller write bugs (which are probably my own fault, but also currently eluding my available time and brain power to track down and fix).

The write bugs are really quite bizarre.

Here is a routine that does work to load data into the DDR RAM:

; copy sector into slow ram
ldx #$00
rr1:
buggyddrwriteretry1:
lda $de00,x
sta $4000,x
cmp $4000,x
bne buggyddrwriteretry1
inx
bne rr1

rr2:
buggyddrwriteretry2:
lda $df00,x
sta $4100,x
cmp $4100,x
bne buggyddrwriteretry2
inx
bne rr2

This routine copies a 512-byte sector loaded from the SD card into a piece of the DDR RAM that has been mapped to $4000-$5FFF.

As you can see I have to retry writes, just in case they don't work.  This sort of bug is easy to deal with.

The following is the earlier version of the routine that does not work, instead causing the same data to be written to $41xx and $40xx due to mysterious factors (or some glaring bug in the routine that I am incapable of seeing):

; copy sector into slow ram
ldx #$00
rr1:
buggyddrwriteretry1:
lda $de00,x
sta $4000,x
cmp $4000,x
bne buggyddrwriteretry1

buggyddrwriteretry2:
lda $df00,x
sta $4100,x
cmp $4100,x
bne buggyddrwriteretry2
inx
bne rr1

The only practical difference is alternating which page of memory gets written to.  Bizarre.

Anyway, the good news is that the fixed version of the routine does seem to correctly load the ROM.  Kickstart now attempts to boot the C65 ROM, but this is thwarted by a read bug in my DDR sub-system.  Below is a piece of what happens. Note that the PC values are the value after the instruction bytes shown on the right have been executed.  I'll add some commentary to help explain what is going on.

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAGS   RGP uS IO
CBAA 00 FF 01 B3 A5 01FB E300 B300 28       65 00 .VE..I.C. ..P 13 -00 --

The CPU has just done a PLP above, and the PC is now at $CBAA. This is in the C65 "Interface ROM" at $C800-$CFFF that acts as the glue between the C64-mode, C65-mode and internal drive DOS parts of the ROM.  Nothing surprising here.


PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAGS   RGP uS IO
F0C1 00 FF 01 B3 A5 01FD E300 B300 60       65 00 .VE..I.C. ..P 13 -00 --

Now we have executed an RTS, returning control to $F0C1 in the C64-mode Kernal, which is in the process of working out whether to stay in C64 mode, or boot into C65 mode.  Again, nothing looks surprising here.  However, it is worth showing the bytes of the ROM at this point:

.df0c0
 :777F0C0 CB A2 FF AD 11 D0 10 FB A9 08 CD 12 D0 90 06 AD

So the PC is now at $F0C1, so we expect the next instruction opcode to be $A2, i.e., LDX #$FF.  So let's see what happens:

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAGS   RGP uS IO
F0C2 A5 FF 01 B3 A5 01FD E300 B300 7B       E5 00 NVE..I.C. ..P 13 -00 --

Er, $7B is not $A2. This is really Not Good.  

And there, my friends, is the current read bug with the DDR controller.   

The good news is that this seems to be a glitch with the DDR read cache, which should be fairly easy to fix.  Although part of me dreads that it might be the DDR controller returning the wrong line of data from the DDR RAM.

The next instruction after this erroneous $7B (TBA) instruction is $00 (BRK).  That is, $F0C1-$F0C2 is being read as $7B $00.  Assuming that the bytes must occur at $xxx1 & $xxx2 in the ROM file to show up in the cache at these offsets (the cache reads 16-byte wide lines from the DDR controller), a quick search through the ROM file reveals that the only instance that fits this constraint is at $CBC1-$CBC2:

0000cbc0: 00 7b 00 5c 00 3d 00 2e 00 16 00 07 00 a2 22 16

This is rather interesting.  The previous DDR memory access prior to $F0C1 was of $CBAA, which is suspiciously close. The DDR cache I have implemented is 8KB in size, consisting of 512 lines of 16-bytes each.  This means that $800x and $A00x would map to the same cache line, and eject each other from the cache.  However, $F0Cx and $CBCx should not map to the same cache line.  

That is, even if the cache logic somehow erroneously was showing a stale cache line, it shouldn't be able to show the cache line for $CBCx.

So, there is either some horrible bug in my cache logic, or it is also possible that the DDR controller is returning the wrong line of data (again, quite possibly due to my poor DDR controller implementation rather than an intrinsic fault in the Xilinx DDR low-level controller or anything else done by someone else).

What is interesting, is that given enough time, it seems to start returning the correct data for the cache row.  That is, if the CPU were to hang around for a few cycles (quite possibly hundreds or thousands of cycles), it all seems to catch up and start delivering the right data.

When a read is requested from the DDR cache, it waits until the DDR controller is presenting the correct row of data, which is tested by examining the top 23-bits of the requested address and comparing those bits to the cache line address field in the data delivered from my DDR controller.  Only when they match, does the cache logic let the CPU resume and read the data.

The cache is implemented as a dual-port memory between the DDR controller, with the DDR controller being the only side that can write, and the CPU the only side that can read.  This acts to avoid all the cross-domain clocking problems between the two.  

So in theory, there should not be any glitches with reading the data that might cause trouble.  This is especially true since the CPU checks the cache line ID as described above, so even if the DDR controller thinks we have asked for something else, the CPU will realise, and persist in requesting the line it was after until the DDR controller gets it right.

This leads me to the frustrating conclusion that the DDR controller is supplying data that it thinks is the right data, but is in fact from a different memory line...  So more DDR controller pain awaits me.

Thursday, 26 March 2015

GHDL Simulation is working again, making it much easier to track down DDR controller bugs

Oh a happy day it will be when the DDR controller is working and I can ignore it for, hopefully, a very long time.  That day feels closer now that I am able to simulate my model in GHDL again.

Early in the year I upgraded GHDL from 0.29 to 0.31 so that I would get meaningful error messages.  However, I discovered a bug in GHDL that prevented me from being able to simulate.

The GHDL maintainers were very good and managed to replicate the problem, fix it in the up-stream version, and also explain a very simple work-around for the meantime.

Somehow I missed the email notification that contained the work-around information, so all this time in between I have been unable to simulate, which has made it rather harder to trackdown the bugs with the DDR controller.

So a bit of hacking yesterday and I now have the model simulating in GHDL again.

This let me work out exactly what was going wrong with the most recent DDR controller bug I had identified.  That bug is that after writing to the DDR memory, the CPU would read the opcode for the next instruction from the DDR memory, instead of from wherever it should have read it.

So, I fired up the simulator and was able to quickly reproduce the bug with a little loop:

loop:  LDA #$99
       STA $4000  ; this address was mapped to the DDR memory
       INC $D020
       JMP loop

When I simulated it, I could see the following.  The bug is that $CF is read from the DDR RAM, instead of $EE (the opcode for INC) being read.  As a result entirely the wrong instruction was executed as can be seen in this sanitised excerpt from the simulator:

$8114 A9 99     lda  #$99          A:99 X:40 Y:00 Z:3F SP:BEFF P:A4 $01=F5
$8116 8D 00 40  sta  $4000         A:99 X:40 Y:00 Z:3F SP:BEFF P:A4 $01=F5 
$8119 CF 20 D0  bbs4 $20,$81EC     A:99 X:40 Y:00 Z:3F SP:BEFF P:A4 $01=F5 

A little poking around revealed that the correct value was being read, but that the CPU state machine was getting a bit confused while waiting for the DDR RAM to acknowledge the write, and when it saw a read from the DDR RAM it kept that instead.  There is a flag that clears the CPU's intent to read once it has the value it is after, but that wasn't being used in one particular place.  So I added the necessary two lines as follows:

      if mem_reading='1' then
        memory_read_value := read_data;
      end if;


... and bingo, it was suddenly working properly:

$8114 A9 99     lda  #$99          A:99 X:40 Y:00 Z:3F SP:BEFF P:A4 $01=F5 
$8116 8D 00 40  sta  $4000         A:99 X:40 Y:00 Z:3F SP:BEFF P:A4 $01=F5 
$8119 EE 20 D0  inc  $D020         A:99 X:40 Y:00 Z:3F SP:BEFF P:24 $01=F5
$811C 4C 14 81  jmp  $8114         A:99 X:40 Y:00 Z:3F SP:BEFF P:24 $01=F5

Oh how nice it is to know that you have fixed a bug before starting a 1-2 hour FPGA synthesis run.

After synthesis, I was able to confirm that the bug was fixed. However, there is at least one more bug to fix, as it seems that when the hypervisor exits to pass control to the ROM stored in DDR RAM, that the reset vector is not being read from the DDR RAM.  But now that I can simulate again, this bug shouldn't be able to hide for long...


Wednesday, 25 March 2015

Hypervisor Upgrading and DDR Memory Lockup-fix

I continue to pull my hair out with the DDR controller -- but it does seem that I have fixed the bug that would cause the DDR controller to lock up forever until a reset.  It still doesn't always write a byte when asked, but even those events are now <1% of the time.  With the lock up bug fixed, I might be able to re-enable infinite write retries in the CPU state machine, and make writing appear reliable.  So this is good progress.

Meanwhile, I have been working on the Hypervisor upgrade process, and have that all nicely working now.  On cold-start, the hypervisor checks for KICKUP.G65, and if present, loads it and replaces itself with the contents of that file, it then sets the one-time "hypervisor upgraded" hardware flag, and jumps into the entry point of the new hypervisor, which seeing that flag, knows not to try to upgrade itself.  I guess I could have also had it do a checksum and if the new and current version are identical, to just continue without replacing itself, but I didn't think of that until later :/

So now the main barriers to actually getting it working on the DDR board is to make writing reliable, and fix a new little (but very important) bug with DDR writing, which causes the next CPU instruction fetch to read from DDR memory instead of from whatever it should have been reading from.  Hopefully once this is fixed, I might finally be out of the woods and able to run the C65 and C64 ROMs again.  I will feel a lot better once I reach that point.

Friday, 20 March 2015

Getting closer on the DDR memory front

The DDR2 memory interface continues to amaze me in the fascinating ways that it almost works.

Today, I had the thought to make the CPU wait until the DDR controller can confirm that it has the correct address and data to write, which is tested by reflecting back to the CPU so that it can indeed verify that the DDR controller has the right data.  Only after this round-trip verification does the CPU ask the DDR controller to do the write.

This has confirmed that the DDR controller appears to receive the correct data when asked to write.

This had me a bit perplexed, because when I tried to write a single byte, nothing would show up.

Then I wrote a few consecutive bytes, and noticed that all but the last three writes occurred.  A little more poking around revealed that these last three writes were not actually lost, but each write would appear in the read data after another write was performed.  That is, the DDR RAM appears to always be three memory writes behind where it should.  The following serial monitor shows this, where the various writes appear progressively as more writes are scheduled:

Here I ask for five writes to occur:

m8000000
 :8000000 01 02 03 04 05 06 07 08 09 0A 0B 0C 22 22 B1 FA
.s8000000 11 22 33 44 55

Using the new debug registers I created to verify the last write address and value, I can see that $400000B = $55, the last value I asked to be written, and the LSB-first address in the slowram is $0000004, which is correct:

.m4000000
 :4000000 3F 35 01 00 00 5E 51 00 0D 00 2F 55 04 00 00 00

Now comes the surprise, the last three values written don't appear:

.m8000000
 :8000000 11 22 03 04 05 06 07 08 09 0A 0B 0C 22 22 B1 FA

... so let's write $FF to $800000F and see what happens, and also make sure that it is $FF that gets written:

.s800000f ff
.m4000000
 :4000000 3F 35 01 00 00 60 61 00 0E 00 30 FF 0F 00 00 00

So we see that $33 has appeared in $8000002, but $FF has not (yet) appeared in $800000F!

.m8000000
 :8000000 11 22 33 04 05 06 07 08 09 0A 0B 0C 22 22 B1 FA

So let's continue this pattern, and see that as I write bytes, previous writes progressively appear, but always 3 writes behind:

.s800000e ee
.m4000000
 :4000000 3F 35 01 00 00 62 71 00 0F 00 31 EE 0E 00 00 00
.m8000000
 :8000000 11 22 33 44 05 06 07 08 09 0A 0B 0C 22 22 B1 FA
.s800000d dd
.m8000000
 :8000000 11 22 33 44 55 06 07 08 09 0A 0B 0C 22 22 B1 FA
.s800000c cc
.m8000000
 :8000000 11 22 33 44 55 06 07 08 09 0A 0B 0C 22 22 B1 FF
.s800000b bb
.m8000000
 :8000000 11 22 33 44 55 06 07 08 09 0A 0B 0C 22 22 EE FF

At this point I wondered if it was a cache issue, so I asked to see a line of memory that would cause the cache for $800000x to be invalidated:

.m8010000 
 :8010000 45 00 81 67 30 00 86 94 41 72 C0 00 81 38 62 12
.m8000000
 :8000000 11 22 33 44 55 06 07 08 09 0A 0B 0C 22 22 EE FF

No change, so it seems that the cache probably isn't involved.  But again, if I start writing to memory, I see the previously dispatched writes suddenly show up:

.s800000a aa

.m8000000

 :8000000 11 22 33 44 55 06 07 08 09 0A 0B 0C 22 DD EE FF

So while this is all rather annoying and mysterious, it is nice to see that memory is being written in the right places, and that it does all eventually turn up.  But then as a test I filled the first 1MB of slow ram...

The first time I tried it, it seemed to fill memory with $00, as I asked, and there were no odd values.

But then when I tried again to fill with $FF, I say about half of the bytes still had $00 in them.

Filling a 3rd time with $00 again, some of those $FF's remained.  Then when I tested performing single writes, the DDR memory was only one write behind, instead of three.

So there is still some other strange thing going on apart from delayed writes.

What might work, as detlef suggested I try a few days ago is to just have the CPU read the memory location just written after writing, and if the correct value isn't there, to issue the memory write again.  Really slow and really annoying (because I will have to adjust the CPU state machine), but it might just work.  I'm running out of alternative ideas, however.

Improving the hypervisor environment

I have a project student who will be doing some work on the hypervisor this year.  The focus of his project is to make a hypervisor environment which is easy to use, and can be used to teach some computer architecture and operating systems concepts, without being as complex as a modern architecture.

In preparation for this, I have been thinking about what basic hardware and firmware features are needed to support this, and allow him to work on the hypervisor without having to resynthesise the whole FPGA every time he makes a small change.  This has also served as a welcome distraction from the DDR memory controller problems I have been facing.

The first thing is that we need a way for traps to the hypervisor to be distinguishable from a hardware reset. Until now, the two have entered the same vector, and thus a trap to the hypervisor was basically a way of asking the machine to reset.

The hypervisor state registers map to $D640-$D67F.  In hypervisor mode, these can be used to modify the CPU and memory mapper state.  From a running program they are used to trap into the hypervisor.  Previously only $D67F would trap into the hypervisor in this way.  The bottom 6 address bits are used to calculate the entry vector in the hypervisor, resulting in entry points at $8000-$80FC, i.e. using four-byte multiples.  I could have used 3 bytes to save a little space, but I didn't want to have to implement a multiply-by-3 in the CPU address decode logic.  Hardware reset enters at $8100, so there are 64 traps plus reset.

These traps provide a mechanism that is a cross between the INT instruction on x86 (which triggers an "interrupt" using a vector at an address computed from the argument), and the SYSCALL type instruction found on many CPU types, that traps to the privileged mode, and passes some arguments through.  The key difference is that on the 45GS10, the trap is straight through to hypervisor mode, and all CPU state is automatically saved and restored for you.  Thus even though we are only at 48MHz, it is theoretically possible to perform somewhere around 8 million hypervisor traps per second -- provided you don't want them to actually do anything.

The final addition is a write-once flag that can be used by the hypervisor to see if it has been upgraded since power-on.  Even CPU reset doesn't clear the flag.  This is used to allow the kickstart ROM to know when it should try to replace itself by loading a specially named file (KICKUP.65) from the SD card.  This will allow new kickstart ROMs to be easily loaded by putting them on the SD card, and then power-cycling the FPGA board.

Monday, 9 March 2015

I still don't like passing signals between clock domains

I am still working on the DDR memory interface for the C65GS.

It feels like it should have been finished ages ago.  However, as I have gone along I have been progressively reminded of various perils of passing signals between different clock domains.

The problem all comes about because the DDR memory controller operates at something like 145MHz, which is some fraction of the DDR bus speed required for the DDR RAM on the Nexys4DDR board.  The key point is that this is not the same as the 48MHz that the CPU runs at.

This means that you can't just pass signals, and assume that they will be read correctly. For example, if you want to ask the DDR controller to read something, you need to make sure that all the address lines have settled before they get sampled by the DDR controller.  The same goes for asking for it to write to memory.  Then there are similar sorts of fun getting results back in the other direction.

To simplify the problem, I have used a simple dual-port memory, i.e., one side can read and the other side can write, so that the DDR controller can provide data to the CPU.  This has the useful side-effect that I can easily construct a cache for the DDR RAM, so that the CPU doesn't have to wait for it so often.  This is what I have done, creating an 8KB cache, that operates with something like 4 wait states.

After lots of frustrating fiddling, I now have the DDR memory interface working most of the time. Some reads and some writes get missed, which suggests that there is still some cross-domain timing glitches that are causing grief.   One in every several hundred memory writes doesn't stick. Inserting some more latches between the CPU and the DDR controller will hopefully settle that down.  Alternatively, I could skip adding extra delay, and instead make the process of writing to DDR memory include a verification step, and to repeat the write if it doesn't work.  This is probably preferable.

There are also some remaining niggly intermittent problems that look to me like the DDR memory is not communicating with the FPGA correctly, i.e., the temperature-compensated termination calibrations that the DDR memory chip does might not be being done often enough or something.  When this happens every other byte is often $00, and writing to any address seems to get ignored.  This last problem is the most annoying, because I don't really have any idea of how to fix it, nor a way to verify that it has been properly corrected.  I would feel very bad if people were buying these Nexys boards only to find that they have frequent problems running the C65GS bitstream on them.  I'll start looking into this once I have the other memory interface problems solved.

Thursday, 5 March 2015

Reworking the DDR memory controller

In between other things, I have been reimplementing the DDR memory controller wrapper, as I attempt to get DDR memory working.

The Digilent supplied wrapper, while good, had a lot of the signal control logic spread around, rather than my preferred (but probably horrible to other people) style of keeping related logic together.  This made it hard to me to actually follow what was going on.

This also gave me the opportunity to switch it to using an 8-bit wide data bus instead of 16-bit, simplifying logic in the CPU a little.

It also allowed me to make the DDR wrapper cache the 16-bit memory line once read, so that subsequent reads could be serviced from the cached copy.  The cache gets updated by writes, to ensure consistency.

I also modified the wrapper so that it accurately reports to the CPU when it has finished servicing a request, so that there is no longer any need to tune the number of waitstate applied to slow ram.  Together with the little cache, this means that the DDR-based slowram will hopefully be faster on average than the old slowram was -- at least once I get it working.

In the process of debugging the caching code (which still has a bug in it), I also made the realisation that much of my trouble with the DDR memory controller is that it was not registered to the CPU clock, and so it was possible for glitching to occur.  Indeed, I noticed this because I finally realised that the funny memory behaviour I was seeing looked very much like cross-clock glitching, where the CPU would sometimes read the previous byte value from the memory, because the memory data hadn't been updated quickly enough, as can be seen below.

 :8000000 92 92 21 01 01 9B 4C 19 19 03 16 16 08 62 55 55
 :8000010 92 2C 21 21 31 9B 9B 19 39 03 03 00 08 08 55 01
 :8000020 92 92 21 01 31 31 4C 19 19 03 16 00 00 62 55 55
 :8000030 01 2C 21 21 31 9B 4C 4C 39 03 03 00 08 62 62 01
 :8000040 92 2C 2C 01 31 31 4C 19 39 39 16 00 00 62 55 01
 :8000050 01 2C 21 01 01 9B 4C 4C 39 03 16 16 08 62 62 01
 :8000060 92 2C 2C 01 31 9B 9B 19 39 39 16 00 08 08 55 01
 :8000070 92 92 21 01 01 9B 4C 19 19 03 16 16 08 62 55 55
 :8000080 92 2C 21 21 31 9B 9B 19 39 03 03 00 08 08 55 01
 :8000090 92 92 21 01 31 31 4C 19 19 03 16 00 00 62 55 55
 :80000A0 01 2C 21 21 31 9B 4C 4C 39 03 03 00 08 62 62 01
 :80000B0 92 92 2C 01 31 31 4C 19 39 39 16 00 00 62 55 01
 :80000C0 01 2C 21 21 01 9B 4C 4C 39 03 16 16 08 62 62 01
 :80000D0 92 2C 2C 01 31 9B 9B 19 39 39 16 00 08 08 55 01

 :80000E0 01 92 21 01 01 9B 4C 19 19 03 16 16 08 62 55 55

Taking a look at this memory dump it can be seen that the various byte values sometimes turn up one memory access late.  It is also possible that in some instances one or more of the data lines turns up late, while others are okay, although there isn't really clear evidence of this. Indeed, if you look at each column, there are only two values in each column, and no hybrid ones. Presumably one is the "on time" value, and the other is the "late" value, i.e., the value from the previous read.  Based on this, we can conclude that the correct values would be:

92 2C 21 01 31 9B 4C 19 39 03 16 00 08 62 55 01

However, that is really just speculation until I resynthesize with the ram_done signal delayed by a cycle to make sure that the data turns up first.

Also, there is the question of why the same 16 bytes of data are being presented every time, instead of the correct 16 bytes being read.  Writing to the DDR memory is also not working with my wrapper yet, so I need to debug that as well.  But at last I feel like I am making progress, and will be able to get the DDR RAM working properly soon.