Sunday 5 November 2023

Fixing Yet Another HyperRAM Bug

Grargle! I thought I had fixed all the bugs with the HyperRAM / expansion RAM interface, but it looks like at least one still remains.

The 585_test program passes, but my older hyperramtest.c program fails to detect the RAM size correctly, and the read stability test quickly reveals something of the nature of the problem:  The two bytes read during a linear read are correct, but half of subsequent bytes are wrong, as it alternates between displaying the correct byte, and then repeating the correct byte instead of presenting the next byte.  Looking at it more closely, the problem seems to be oriented around cache lines, that is, the first two bytes of a cache line will be correctly read.... or something like that.

So let's start by reducing it down to a simplest failing case: Writing values to $8000000 -- $8000002 and then reading them sequentially looks like it should trigger it, like this:

Okay, so we have a minimum failing test case. Now to subject it to simulation, and confirm that we can reproduce it there.

Hmm.. simulation of just the slow memory interface and hyperram alone doesn't reproduce the error. So I'm guessing it requires CPU involvement.

Well, simulating with the CPU as well still doesn't cause it to show up, so it must be something with marginal timing.

I believe the timing problem is on the read side, rather than when writing.

What I would really like to establish, is whether the problem is in slow_devices, hyperram, or the timing constraints on the physical pins of the FPGA that connect to the HyperRAM.

One way to start trying to peel this back, is to use the debug registers I built into the HyperRAM controller at $BFFFFFx. These are helpful here, because they don't have any latency due to the HyperRAM, and we know that the values read at those addresses can't be messed about by any potentially dodgy communications with the HyperRAM chip.  In particular, $BFFFFF2 (controller mode information), $BFFFFF3 (write latency), $BFFFFF4 (extra write latency) and $BFFFFF5 (read time adjust) are all read-writeable registers, that don't get changed by the controller when it operating.

So I wrote this little program to test the stability of reading these registers:

If it's working correctly, we should see only the same value in each column. But instead we see this:

I.e., successive pairs of reads seem to read the same value, then the following read does the right thing, but it's value then gets read again on the next transaction.

In other words, this tells us that the problem is not in the HyperRAM controller, but rather in the interface between the HyperRAM controller and the slow_devices, or between the slow_devices and the CPU.

We can eliminate the latter as a likely source, because we can read from other regions of the slow_devices memory without seeing such effects.  That is, the problem very likely lies in the interaction of the HyperRAM with the slow_devices.

The communication of results between these two occurs using a data ready toggle, which naturally has 2 states, and thus makes me a bit suspicious that it might be involved, since the problem we see is very much related to pairs of successive reads.

What would be nice, would be to be able to read the status of the ready toggle and the expected value of that toggle from the slow_devices module, so that we can see if it's getting confused in some way. There is supposed to be some nice unmapped memory in the slow_devices module that basically works like a slab of read-only registers, but it's not doing the right thing, which might be a clue. I'll have to think on that.

Meanwhile, I've modified my little program above to do a memory access between each one displayed to test and confirm some of my supsicions:

If we access memory location $8000000 between each iteration, so that each displayed read is from a separate pair of reads, then it either works properly like this, or always displays the contents of $8000000, depending which way around the pairing is acting:

But if I access a location that is still on the slow_devices module, but not in the HyperRAM, say, an address on the cartridge port interface, then the doubling remains:

This makes it much more likely that the problem is in the interface between the HyperRAM and slow_devices.  It's just really annoying that it doesn't show up under simulation.

So what else can I figure out from the behaviour I am able to observe to help track it down? Interestingly, writing different values to $BFFFFF2 changes things.

$E0 is what it was, and behaves as above. But $01 or $02 cause the doubling-up to return, even with the read of $8000000 between. Those enable fast command and fast read mode. It looks like bit 7 has to be set for the problem to go away via the read to $8000000. 

Ok, how about if we read from $BFFFFF0 instead of $80? In that case, the value in $BFFFFF2 doesn't matter at all -- it always reads correctly, provided that extra PEEK is in there. This test is interesting, because it does not involve touching the actual HyperRAM chip at all -- it's all just internal registers in hyperram.vhdl. So if the HyperRAM chip is not involved, it can't be the problem.  This makes me increasingly confident that the problem is in the communications between the slow_devices and hyperram modules.

The hyperram module drives the toggle at 162MHz, and is caught by slow_devices in the 81MHz clock domain. That _shouldn't_ be a problem, because we are using a toggle rather than a strobe.  But who knows what funniness might be going on. It might be glitching, for example. In which case adding a drive stage to the toggle on the export side might help to fix that.  Actually, related to that, the ability to select between the hyperram and SDRAM means that there is a bit of combinatorial logic at the top level that multiplexes between the data ready toggle from these two sources -- that could also be adding a little bit of delay that might be causing some havoc on the latching side in the slow_devices.

Another little test I could do, would be to write an assembly routine that does these accesses, and times whether the bad reads are timing out in slow_devices, and thus take longer. If so, that would tell us that the toggle line is not being seen to change.

I'm adding a debug register at $F000000 that will let me check those toggle lines directly. That shows nothing untoward.  I've added an extra signal to that register that samples the toggle in the block that actually uses it, in case it is being set with some delay.

Actually, while thinking about that, I realised that I have another nice way to diagnose where things are going wrong: Switching from HyperRAM to SDRAM.  If the doubling still happens, its in slow_devices. If it doesn't still happen, it _must_ be in hyperram.vhdl. ... and the verdict is, it must be in hyperram.vhdl.

Next thing to try is to add a debug register to slow_devices.vhdl that will let me see if the data ready toggle is arriving before the data does.

What I did instead, was add a 1 cycle delay to the toggle, so that the data value would be setup a full cycle early, so that if there was any clock phase issues between the source 162MHz clock and the destination 81MHz clock, the toggle would definitely not be noticed before the data was made available.

This has got reading working fine in the general sense, without the doubling of data. However, when I run the hyperramtest.prg, it seems to trigger it to cause the problem. Perhaps because a memory read times out or something.

Actually, the problem there seemed to be it was fiddling with the cache settings of the HyperRAM controller, which was upsetting things. I've patched hyperramtest.prg to not do that, and it then passes the main test.

So that has things a bit further, but now the "mis-write test" in hyperramtest.c is consistently failing in a curious way: Writing to the HyperRAM continues to work (which I can verify by reloading the core and checking its contents), but reading from it ceases to work.  Hmm.. the latest synthesis run doesn't have this problem, so presumably I fixed it along the way. I'm just doing a final synthesis run to be 100% sure... and with that new bitstream it's also fine. So whatever that problem was, it was just a bad bitstream build, either because I hadn't merged all the changes in, or the Vivado randomness of synthesis was causing dramas again.

Anyway, so far as I can tell now, the HyperRAM controller is now rock-solid, and doing all that it should. So hopefully that's the end of that until we have the time and energy to upgrade the HyperRAM controller to use the higher-performance one that Michael built for MiSTer cores.

Saturday 28 October 2023

Reading Amiga disks in the MEGA65

Geehaf asked some questions about how to read Amiga disks in the MEGA65, so I started looking into what's involved.  

Note that this is for reading Amiga disks from the MEGA65 core, and has no impact on any future Amiga core that would allow you to run Amiga software on the MEGA65. That is quite a separate activity, that we look forward to seeing, but isn't what I'm working on today.

First, we have to understand that the MEGA65 floppy interface hardware can read and write any format at all -- because the FDC "chip" is implemented inside the FPGA, which has direct connections to all the FDC cable signals.  This also means that when an Amiga core comes out for the MEGA65, it will be able to read and write Amiga disks natively in the internal drive just fine.  In fact, with only minor work, it will be able to read and write Amiga HD disks at 2x the speed that the Amiga could, because instead of slowing the drive down, we can just artificially slow down the magnetic flux signals.

So, that all means that we just need to improve the FDC in the MEGA65 core if we want the MEGA65 to be able to read Amiga disks.  (We'll get to writing a bit later).

Specifically, we need to understand how the Amiga writes tracks of data out to the disk.  I found this nice (and hopefully correct) description here:

The following are details about how the bits on the Commodore-Amiga
disk are actually written.

    Gross Data Organization:

        3 1/2 inch (90mm) disk
        80 cylinders/160 tracks

    Per-track Organization:

        Nulls written as a gap, then 11 or 22 sectors of data.
        No gaps written between sectors.

    Per-sector Organization:

        All data is MFM encoded.  This is the pre-encoded contents
        of each sector:

            two bytes of 00 data    (MFM = $AAAA each)
            two bytes of A1*        ("standard sync byte" -- MFM
                                     encoded A1 without a clock pulse)
                                    (MFM = $4489 each)
            one byte of format byte (Amiga 1.0 format = $FF)
            one byte of track number
            one byte of sector number
            one byte of sectors until end of write (NOTE 1)
                [above 4 bytes treated as one longword
                 for purposes of MFM encoding]
            16 bytes of OS recovery info (NOTE 2)
                [treated as a block of 16 bytes for encoding]
            four bytes of header checksum
                [treated as a longword for encoding]
            four bytes of data-area checksum
                [treated as a longword for encoding]
            512 bytes of data
                [treated as a block of 512 bytes for encoding]

    The track number and sector number are constant for each particular
    sector. However, the sector offset byte changes each time we rewrite
    the track.

    The Amiga does a full track read starting at a random position on the
    track and going for slightly more than a full track read to assure
    that all data gets into the buffer. The data buffer is examined to
    determine where the first sector of data begins as compared to the
    start of the buffer. The track data is block moved to the beginning
    of the buffer so as to align some sector with the first location in
    the buffer.

    Because we start reading at a random spot, the read data may be
    divided into three chunks: a series of sectors, the track gap, and
    another series of sectors. The sector offset value tells the disk
    software how many more sectors remain before the gap. From this the
    software can figure out the buffer memory location of the last byte
    of legal data in the buffer. It can then search past the gap for the
    next sync byte and, having found it, can block move the rest of the
    disk data so that all 11 sectors of data are contiguous.

So, this tells us that Amiga disks use only 2 sync bytes, instead of the usual 3 that PC and 1581 disks use. Then we expect to see a $FF byte, followed by track and sector numbers, and if we're a bit lazy, 25 bytes of stuff we don't have to deal with.  

One of them is used to indicate where on the track the sector has been written, 16 are a magic stash for the Amiga DOS to put stuff, which we really can ignore, so far as I know, and 8 more are CRCs for the header and data block which we will ignore for now, partly because I don't know the CRC that the Amiga used (maybe standard CRC32), and partly because for testing, we just don't need to check it, and finally, because reading Amiga disks in the MEGA65 in this easy way is not a high-stakes exercise: The occassional missed CRC error won't cause us great grief.  If you want the certainty, then do the track-at-once flux reading using DMAgic, and check the CRCs in that.  

That all said, this doesn't mean that we can't implement the CRC check, just that I'm not bothering to implement them right now.

It would be great to find an Amiga disk in my pile, and see if I can read it, else that will be upto Geehaf to try. Also, it would be great to get the raw flux of a track of an Amiga disk, so that I can make a VHDL test case, especially if my modifications don't work first time.

So, speaking of those modifications, it's actually a really small and simple change, assuming I have understood everything correctly.  It's all in mfm_decoder.vhdl:

First up we check for the Amiga 2x SYNC + $FF prefix:

      elsif byte_valid_in='1' then
        sync_count <= 0;
        if sync_count = 2 then
          if byte_in = x"FF" then
            -- Amiga sector marker.
            -- See info block at top of file for how we have to handle these
            state <= AmigaTrackNum;
          end if;
        end if;
        if sync_count = 3 then
          -- First byte after a sync
          report "TRACKINFO: Post Sync byte = $" & to_hstring(byte_in);
          if byte_in = x"FE" then
            -- Sector header marker


Then we just have to add some state machine states for the reading:

            when AmigaTrackNum =>
              seen_track <= byte_in;
              seen_side <= x"00"; -- Amiga disks don't encode the side
              sector_size <= 512; -- Amiga disks have fixed sector size
              state <= AmigaSectorNum;
            when AmigaSectorNum =>
              seen_sector <= byte_in;
              amiga_skip_bytes <= 25;
              state <= AmigaSkipBytes;
            when AmigaSkipBytes =>
              if amiga_skip_bytes /= 0 then
                amiga_skip_bytes <= amiga_skip_bytes - 1;
                if (target_any='1')
                  or (
                    (to_integer(target_track) = to_integer(seen_track))
                    and (to_integer(target_sector) = to_integer(seen_sector))) then
                  found_track <= seen_track;
                  found_sector <= seen_sector;
                  found_side <= seen_side;
                  sector_found <= '1';
                  seen_valid <= '1';
                  byte_count <= 0;
                  state <= AmigaSectorData;
                  seen_valid <= '0';
                  state <= WaitingForSync;
                end if;
              end if;
            when AmigaSectorData =>
              if (byte_count = 0) and (seen_valid='1') then
                first_byte <= '1';
                first_byte <= '0';
              end if;
              byte_out <= byte_in;
              byte_valid <= seen_valid and byte_valid_in;
              if byte_count < sector_size then
                byte_count <= byte_count + 1;
                -- We ignore CRC for Amiga disks for now.
                crc_error <= '0';
                -- Report end of sector
                sector_end <= '1';
                sector_found <= '0';
              end if;

And that's all there is to it. Everything else is done by the existing machinery that handles the decoding of PC/1581 MFM sectors and MEGA65 RLL sectors.  If nothing else, it makes me feel warm and fuzzy about the disk reading architecture that I implemented, that adding another disk format is that simple. 

Now to wait for a bitstream to build, and see if I have got lucky, and it just works first time :)

Okay, bitstream is built, and I found an Amiga disk (actually an original set of Workbench 3.0 disks), and tried reading it, but unfortunately no sectors are being seen.  Now to find out why.


I used the src/tests/floppytest.prg built from mega65-tools repo to read the raw flux of track 0.  So now I have a 64KB file of flux inversion intervals, that I can try to decode off-line.  src/tools/mfm-decode.c is the utility I wrote for the 1581 disk format originally, and it shouldn't be too hard to modify to handle the Amiga format. That way I can know if my disk is good, and if so, use it as the basis for an automated test of the Amiga disk support in mfm_decoder.vhdl.

I've started by adding support for data field type $FF to mean Amiga sectors, and I am seeing it report 542 bytes between pairs of sync bytes.  28 header bytes plus 512 data bytes = 540 bytes. Add in the two bytes of $00 written before the sync bytes (and thus also seem to appear at the end of a sector), and we get the total of 542 bytes. So that looks good. However, the track and sector numbers seem to not be valid. I'm seeing the following for the first four bytes of the header for the various sectors that appear in my track dump:

$f0 $03 $f1 $01
$f0 $03 $f1 $10
$f0 $12 $f1 $01
$f0 $12 $f1 $10
$f0 $01 $f1 $23
$f0 $01 $f1 $32
$f0 $10 $f1 $23
$f0 $10 $f1 $32
$f0 $21 $f1 $01
$f0 $21 $f1 $10
$f0 $30 $f1 $01
$f0 $03 $f1 $01
$f0 $03 $f1 $10

As there are only 11 sectors on the track, from those last 2 we are looping back around to the start of the track.

Okay, those do not look like what I expected, with one byte of track number, and another byte of sector number.  Let's convert them to binary, and see if we get any more clues that way.

11110000 00000011 11110001 00000001
11110000 00000011 11110001 00010000
11110000 00010010 11110001 00000001
11110000 00010010 11110001 00010000
11110000 00000001 11110001 00100011
11110000 00000001 11110001 00110010
11110000 00010000 11110001 00100011
11110000 00010000 11110001 00110010
11110000 00100001 11110001 00000001
11110000 00100001 11110001 00010000
11110000 00110000 11110001 00000001

Now, allowing for the fact that the we don't know the order the sectors were written to the track, we don't know which of those is sector 0 (or does the Amiga count them from sector 1?). It would be really handy to have some more detailed information about low-level Amiga disk layout stuff than that page I already found.

Okay, now I have a clue: The odd bits are encoded, then the even bits, for groups of 32 bits. Reading around the place, like, here, for example, I am reminded that the Amiga did some funky tricks for the disk encoding/decoding stuff. This is because the Amiga's FDC controller is quite "dumb", and uses the CPU to find the sectors in the raw flux data etc.  As MFM requires to bits per data bit encoded, they used a clever hack of encoding all the odd bits, then the even bits, so that they could mask out the bits not being encoded, and instead insert the necessary MFM clock bits where required, without having to shuffle everything around the place.

Anyway, this means we need to re-interleave the bits together, before examining them for track and sector info. Also, the _entire_ 512 byte data block is also similarly encoded.  So we will need a buffer of some sort to demunge them, or modify the sdcardio.vhdl to have a function to "demangle Amiga sector", or just pass the raw sector to the user to process however they wish.

But let's start with the 4 bytes of track/sector info, and de-interleave those bits back together.  As I only need to do this once, I'll just make an emacs macro that does it, and paste the result below:


Okay, now that looks way more sensible. Convert those back to hex, and we get:

$FF $01 $00 $0B
$FF $01 $01 $0A
$FF $01 $02 $09
$FF $01 $03 $08
$FF $01 $04 $07
$FF $01 $05 $06
$FF $01 $06 $05
$FF $01 $07 $04
$FF $01 $08 $03
$FF $01 $09 $02
$FF $01 $0A $01

So, it looks like after the MFM sync bytes, it indeed has the $FF to indicate Amiga, but then has the $FF also encoded in that first long-word of data.  No drama. We can easily accommodate that. Also the bit munging I can handle fairly readily in VHDL as well. Of course, anyone who has worked deeply on the Amiga would have known all that, but I never did, so it was an interesting journey of discovery for me.

Now, all this discovery says to me that if I enable the FDC in the MEGA65 to identify any sector, not just the targeted sector, then it should still see the Amiga sectors, just that it will think that they are all track $F0 and sector $01.

Next step is to implement the bit munging in VHDL, and then use the track dump I made as a test vector to let me more quickly make sure I have it right, and fix any silly bit order bugs.

Okay, making the simulation test was, as usual, a good idea. I found several annoying little bugs quite quickly, and have it synthesising again now... and now the disk read test sees the Amiga sectors. No idea if it reads the data in them or not yet. That will have to wait for Geehaf's investigation. 

Note that it might be some time before this feature will end up in a MEGA65 release core, as there is still some way to go.  But it's a nice start.

Monday 23 October 2023

Hardware-Accelerated IEC Serial Interface - Part 1

One of the main unfinished aspects of the C65 ROM is the IEC serial interface.  From memory, it is basically copied from the C128, and massaged for the C65's different hardware.  However, it is not particularly well adapted for the 3.5MHz mode, and had some quite nasty timing bugs.  Those bugs are amplified on the MEGA65, where the CPU can run at 1MHz, 2MHz, 3.5MHz or 40.5MHz. The core problem, as I recall, is that it counts rasters to determine some of the delays, and if a raster happens to tick over soon into one of those delays, it can cause a very much smaller delay than intended.

Getting this kind of code 100% can be a bit of a pain, because its one of those situations where software is trying to pretend to be hardware: i.e., offer precise and reliable timing characteristics.  The code also takes a lot of space in the ROM. So it would be nice if we can resolve this.  

It would also be great to have the MEGA65 support JiffyDOS out of the box, without the need for a (currently non-existent) C65 ROM that contains JiffyDOS. We do have an open implementation of the JiffyDOS protocol in the OpenROMs project, but it would still take more space in the ROM.

Another nice benefit, is that it would make it much easier to write software for the MEGA65 that doesn't use the ROM, as talking to IEC serial peripherals will basically use a few POKEs in place of a few JSRs.

So given all the above, I wondered about the plausibility of implementing the IEC serial protocol directly in VHDL, and providing a simple set of registers for controlling this.  It sounds simple, but as with everything with the IEC serial protocol, things are never quite as simple as one might hope.

The first step was to create iec_serial.vhdl, and begin implementing various IEC serial operations, and build this in a little toy bitstream that has only the IEC interface and serial UART communications, so that I can manipulate it easily from my Linux machine. This also has the benefit that I can synthesise a bitstream much, much faster, in about 5 to 10 minutes, instead of around an hour. I am looking at upgrading FPGA build machine I have access to, but until then, this is a massive advantage.

Here is my initally planned memory map for the IEC serial hardware accelerator. It's likely to change a bit, but you get the general idea:

$D697 -- IRC control

$D697.7 AUTOIEC:IRQFLAG Interrupt flag. Set if any IRQ event is triggered.
$D697.6 AUTOIEC:IRQRX Set if a byte has been received from a listener.
$D697.5 AUTOIEC:IRQREADY Set if ready to process a command
$D697.4 AUTOIEC:IRQTO Set if a protocol timeout has occurred, e.g., device not found.
$D697.3 AUTOIEC:IRQEN Enable interrupts if set
$D697.2 AUTOIEC:IRQRXEN Enable RX interrupt source if set
$D697.1 AUTOIEC:IRQREADYEN Enable TX interrupt source if set
$D697.0 AUTOIEC:IRQTOEN Enable timeout interrupt source if set

$D698 - IEC serial interface status (when reading)

$D698.7 AUTOIEC:STNODEV Device not present
$D698.6 AUTOIEC:STNOEOF End of file
$D698.5 AUTOIEC:STSRQ State of SRQ line
$D698.4 AUTOIEC:STVERIFY Verify error occurred
$D698.3 AUTOIEC:STC State of CLK line
$D698.2 AUTOIEC:STD Sate of DATA line
$D698.1 AUTOIEC:STTO Timeout occurred
$D698.0 AUTOIEC:STDDIR Data direction when timeout occurred.

$D698 - IEC serial interface command (when writing)

The list of commands I will support will likely vary, but are currently in two main groups: Commands for debugging/bitbashing the IEC bus, and actual IEC operations:

Bit-bashing / Debug operations:

$41 (capital A) - Set ATN line to +5V
$61 (lower-case A) - Set ATN line to 0V
$43 (capital C) - Set CLK line to +5V
$63 (lower-case C) - Set CLK line to 0V
$44 (capital D) - Set DATA line to +5V
$64 (lower-case D) - Set DATA line to 0V
$53 (capital S) - Set SRQ line to +5V
$73 (lower-case S) - Set SRQ line to 0V
$52 (capital R) - Set RESET line to +5V
$72 (lower-case R) - Set RESET line to 0V

Those are all already implemented.  The remaining commands are the IEC operations:

$00 - Abort any currently running command
$30 - Send data byte with ATN asserted (i.e., ATN at 0V)
$31 - Send normal data byte
$32 - Receive byte
$33 - Send EOI without sending a byte
$34 - Send a byte with EOI
$35 - Turn-around from talk to listen

Only $00 and $30 are implemented so far, and I'm currently debugging $30 at the moment.  But more on that after we have discussed the other registers...

$D699 - Data byte (read for received byte, write to set byte to transmit)

$D69A - IEC serial peripheral device information (fully writeable)

$D69A.7 AUTOIEC:DIPRESENT Device is present
$D69A.5-6 AUTOIEC:DIPROT Device protocol (00=1541,01=C128/C65 FAST, 10 = JiffyDOS(tm), 11=both
$D69A.4 AUTOIEC:DIATN Device is currently held under attention
$D69A.0-3 AUTOIEC:DIDEVNUM Lower 4 bits of currently selected device number

The device information register is writeable, so that if for some reason you want to use a different protocol than is automatically detected, you can.  The auto-detection of C128 fast, and JiffyDOS protocols differ in various ways, and are quite interesting topics.  Both come into play when sending a byte under attention, which is also the first command that I have been working on.

The C128 fast serial protocol detection works by having the controller send a $FF byte using the SRQ line as the clock, before sending the first bit under attention.  The device doesn't acknowledge this immediately, but rather just remembers it. It then knows that it is free to send bytes to the computer using the fast serial protocol.

Well, that's how it works when the controller initiates communications.  If a disk drive has been asked to TALK, then when the controller releases the ATN line, and before sending any bits of data, it does much the same in the other direction, except this time, the byte will be $00, because the device is holding the DATA line at 0V.

JiffyDOS works differently: The detection is done at much the same time, i.e., during the sending of a byte under attention.  However, instead of doing the SRQ thing, it instead delays for a few hundred micro-seconds when sending the 7th bit, and checks to see if the device pulls the data line to 0V. If it does, then it supports JiffyDOS. 

A device can of course, support both, and I'm going to support both in this thing.

Now, to debug all this, I need a convenient way to see what is going on.  I have only a two-channel oscilloscope, but have 4 lines I need to monitor.  Also, monitoring the lines externally doesn't let us see whether the controller or a device is holding a line at 0V.

To solve this, I am adding a temporary debug BRAM to the IEC controller that logs the state of the four lines as output by the controller, and also the SRQ, DATA and CLK pins as input. This means that we can tell if the controller has released a line, but a device is holding it at 0V, but not the other way around.  We also track the state of the RESET line. It also tracks the state in the IEC controller state machine, so that I can easily visualise exactly what is going wrong and where.

I have set this up so that it takes a sample slightly faster than 1MHz, as this is the fastest that my controller of a device will change the lines. Thus a 4KB BRAM gets us about 4ms, which is longer than a single byte transfer. If necessary, I can add extra BRAMs to test longer transactions.

I have this more or less working now, and now need to make a tool to read this data out, and visualise it.  My hacky test bitstream framework isn't particularly fast, so I might have to think about ways I can speed it up from the ~20 bytes per second of debug RAM it is currently accessable at with the various overheads of what I hacked together.  But we'll start by just seeing how it goes.

Probably what would be most useful is to render a PNG image showing the waveform of all the signals, and indicating whether it is a device or the controller (or controller + device) that is holding each line at 0V.

With a bit of hacking, I have this working, generating PNG files of the traces for me:

Red indicates that the controller has pulled a line low (or both are pulling it low), while blue indicates that a device is pulling a line low. Black means neither is pulling the signal low. The yellow lines indicate the voltage visible on each line. The numbers underneath indicate which state the state machine was in at the time. This makes it really easy to match the events back to the VHDL.

Clearly I have some problems here, because RESET and ATN are both being asserted to 0V, as are in fact CLK and DATA.  I think there are some problems with the polarity sense of some or all of the lines, which I'll confirm with a multi-meter.

I'm away for a couple of days, so don't have a MEGA65 or 1541 with me here. I'd like to continue testing the controller however. So the best option for that is to implement a 1541 in VHDL, and use that to test with.  VHDL simulation is slow, however, so I'd like to skip the bulk of the initialisation stuff that the 1541 ROM does.  So I'm looking at a disassembly of the 1541 ROM to see how I might be able to patch it to achieve this.

So let's trace through:

1541 reset vector points to $EAA0. 

$EAA7 - $EB21 performs various tests of the ROMs and RAM. We can skip those, and go directly to $EB22. This is good, because those tests will take quite a large number of cycles.

$EB22 - $EB39 sets up CIA registers, so we need to keep that.

$EB3A - EB4A sets the device number from the pads/jumpers, so we need to keep that.

$EB4B - $EBD4 sets up pointer tables, we need to keep that. However, along the way, it calls $F259 to initialise the disk controller. I'm suspecting that that might be a bit slow. We'll take a look at that in a moment.

$EBD5 - $EBE3 sets the initial DOS status to 73, and releases the CLK, DATA and ATNA lines. So we should keep that. But, hang on a minute -- what is the ATNA line? Does this mean that a 1541 can assert ATN ? (the C64, or MEGA65 for that matter, can't read it, however, so this is just an interesting curiosity).

$EBE4 then calls $E780 to see if we need to boot a system routine. Not quite sure what that means either yet. We'll add it to our list of things to investigate.

$EBE7 is the next instruction, and this routine is labeled "Idle loop. Wait for something to do." This seems to be the place we want to get to, for the ROM to start responding to commands.

So let's look at those routines at $F259 and $E780 to check for any nasty surprises.

$F259 is the controller initialisation, when then bleeds into checking the job queue, and doing the next thing.  It looks like it has to execute perhaps a hundred or so instructions. It then exits via $F99C which can take a while if the drive is active, but looks to be fairly benign if the drive is idle.

$E780 lets a 1541 load a file from disk into its own RAM and execute it on power on. I never knew that this was possible, without a computer commanding it via the & DOS command.  However, apparently, if you power a 1541 on with CLK and DATA both grounded, it will do exactly that. Anyway, for our purposes we don't need that. We just need to make sure that it _doesn't_ happen, as it will just waste a lot of cycles.  So we need to make sure that within the first few cycles of simulation, CLK and DATA are released high.

Now, in terms of read sense of the CLK and DATA lines, we can tell this from the ROM as well:  Both are inverted, i.e., reading a bit value of 0 means that the line is at 5V.  Clock is on port $1800 bit 2, and data on bit 0 of the same port.

So I think that's everything we need to know right now. 

Based on the above, if I patch the 1541 ROM to have NOP instructions ($EA) in $EAA7 - $EB21, that should save the bulk of boot time, before the drive is ready to listen.

Now, in terms of implementing the 1541 in my VHDL test framework, I already have a partly implemented 1541 that was intended for use as a soft virtual 1541 in the MEGA65 (which I still intend to do).  So it's probably best to just finish that off.  It already has ROMs and RAMs mapped, and a dummy CPU, and 6522 CIA chips from another open-source project.  The good thing is that for the CPU, I don't need to implement all opcodes just yet, just the ones used in the 1541 ROM.  Doing it this way will also bring us closer to having the internal virtual 1541 implemented.

I've implemented a bunch of the instructions, and the CPU on the 1541 progresses, but ends up in a loop reporting a drive error. Now, when the 1541 boots up, it does declare the "error" 73 that contains the drive identification string. That looks to be what is happening here, because I can see the call to ERRTS0, that declares a drive error with track and sector both set to zero, which is what happens with the initial "error" 73.

However, because I haven't yet patched out the drive ROM and RAM tests, it takes a long time to reach that point. So I'll patch those out first, and then re-run the test to see whether it sets the error and gets to the main loop of the drive routine.

Ok, so we reach the code to set error 73 after 456 instructions.  However, it never finds the error message. I'm guessing this is because the ($nn,X) addressing mode might be broken.

I fixed those problems, and now after 2954 instructions it finds the tokenised error 73 message at $E718.  The 1541 ROM saves space by not storing the error messages as plain text, but rather has bytes that represent some of the common words that occur in the error messages. This means that it's a bit more complicated to trace the code through

From there it gets to $E73D, and then starts to detokenise and output the error message.  The message gets written out using some STA ($A5),Y instructions, which should make it fairly easy to trace through, and verify that the 6502 CPU is working correctly for this part.

Let's trace those instructions:

Instr#:2968 PC: $E763, A:43, X:00, Y:03, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:2983 PC: $E763, A:42, X:00, Y:04, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:2998 PC: $E763, A:4D, X:00, Y:05, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3013 PC: $E763, A:20, X:00, Y:06, SP:3D NVxBDIZC=00100111,  Decoding i_sta, mode = m_inny
Instr#:3028 PC: $E763, A:44, X:00, Y:07, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3043 PC: $E763, A:4F, X:00, Y:08, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3058 PC: $E763, A:53, X:00, Y:09, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3073 PC: $E763, A:20, X:00, Y:0A, SP:3D NVxBDIZC=00100111,  Decoding i_sta, mode = m_inny
Instr#:3088 PC: $E763, A:56, X:00, Y:0B, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3103 PC: $E763, A:32, X:00, Y:0C, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3118 PC: $E763, A:2E, X:00, Y:0D, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3133 PC: $E763, A:36, X:00, Y:0E, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3148 PC: $E763, A:20, X:00, Y:0F, SP:3D NVxBDIZC=00100111,  Decoding i_sta, mode = m_inny
Instr#:3163 PC: $E763, A:31, X:00, Y:10, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3178 PC: $E763, A:35, X:00, Y:11, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3193 PC: $E763, A:34, X:00, Y:12, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3208 PC: $E763, A:31, X:00, Y:13, SP:3D NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3217 PC: $E6E5, A:2C, X:00, Y:14, SP:43 NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3235 PC: $E6B8, A:30, X:00, Y:15, SP:3F NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3241 PC: $E6B8, A:30, X:00, Y:16, SP:41 NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3245 PC: $E6EF, A:2C, X:00, Y:17, SP:43 NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3263 PC: $E6B8, A:30, X:00, Y:18, SP:3F NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny
Instr#:3269 PC: $E6B8, A:30, X:00, Y:19, SP:41 NVxBDIZC=00100101,  Decoding i_sta, mode = m_inny

In other words, it writes out CBM DOS V2.6 1541,00,00. So far, so good!

From there, it heads back to $EBDA, and then to $EBE7, which is the idle loop for the 1541:

Instr#:3291 PC: $EBE7, A:00, X:00, Y:19, SP:45 NVxBDIZC=00100110,  Decoding i_cli, mode = m_impl

So let's get back to why we were doing all this: I am trying to make hardware-accelerated IEC bus control for the MEGA65, and to debug that, its handy to have a IEC bus client device in VHDL, so that I can debug it all under simulation, cycle by cycle.

Now that the CPU gets far enough, we need to wire-in the 6522 VIA IO chips of the 1541, so that we can actually connect it to the bus of our hardware-accelerated IEC controller.  Let's look at how this is done on the 1541:

This shows the connection of VIA1 at $1800 to the IEC bus:

Port B Bit 0 - DATA (input)
Port B Bit 1 - DATA (output)
Port B Bit 2 - CLK (input)
Port B Bit 3 - CLK (output)
Port B bit 4 - ATNA (output??)
Port B bit 7 - ATN (input)
CA1 - ATN (input)

Well, the first thing I noticed here, is that the 1541 seems to be able to assert the ATN line, which I never knew was possible. This is for one of the early board revisions, and this capability might have been removed on later revisions.

Otherwise, it looks pretty sensible. ATN is tied to both PB7 and CA1, presumably so that ATN can be easily read on the data port, but can also trigger an interrupt via the CA1 line.

The CLK and DATA output lines will pull the IEC lines low when those signals are high, thanks to the 7406 inverters, or at least that's how it looks to me at first glance. Yes, looking at the 1541 ROM at $EBE8 we can see that to release all IEC lines, it ANDs $1800 with #$E5, i.e., clearing bits 1, 3 and 4.

So let's hook those all up, and add some debug logic to track the IEC bus state in simulation.  One of the nice things about doing this in simulation, is that we can see who is pulling a line low at any given time, which really helps with protocol debugging.

After a few hours debugging all the plumbing, and getting all the inversions of signals along the way correct, the VHDL 1541 now correctly triggers an IRQ when the ATN line is pulled low by the controller.  Things then go to pot, because I hadn't implemented interrupts properly, let alone the RTI instruction. So I better fix those, then see how it goes.  

Okay, with that, we are now catching the ATN signal, and starting to receive a byte. The routine to handle an ATN request at $E85B is reached. So now to trace that through...

We get to $E884, which is the point where the ATN service routine calls the IEC read byte routine.  Now we are getting to the juicy part, where we can get useful information about what is going.

We then get to ACP03 at $EA0B, which is the loop to receive each bit of the byte. This seems to go through correctly several times, before getting stuck. Looking closer, it looks like the IEC transmission is getting out of sync with the receiver quite badly.  This is why simulation is good, because I can now tweak my IEC trace generator program to parse the output of the simulation file, and draw me a nice trace of the signals -- and because under simulation we know who is pulling a line low, we can see _exactly_ what is going on, and going wrong.

I've opted for a textual output instead, partly because I can have it tell me more of the story that way. I have it interleave information about the state of the bus line, including who is pulling which lines low, as well as the IEC bus state in the IEC controller, as well as where the 1541 ROM is currently executing. This gives a great deal of transparency as to what is going on.  This is what we are seeing at the moment:

  +12117.426 : ATN=0, DATA=1, CLK=1
       +0.025 : ATN=0, DATA=1, CLK=0(C64)
            iec_state = 122
            iec_state = 123
$E85B            1541: Service ATN from C64
     +295.341 : ATN=0, DATA=0(DRIVE), CLK=0(C64)
       +0.074 : ATN=0, DATA=0(DRIVE), CLK=1
$E9C9            1541: ACPTR (Serial bus receive byte,pc)
      +93.064 : ATN=0, DATA=1, CLK=1
            iec_state = 128
            iec_state = 129
       +0.099 : ATN=0, DATA=1, CLK=0(C64)
            iec_state = 131
       +4.370 : ATN=0, DATA=0(C64), CLK=0(C64)
            iec_state = 132
            iec_state = 133
      +14.988 : ATN=0, DATA=0(C64), CLK=1
            iec_state = 135
      +20.001 : ATN=0, DATA=1, CLK=0(C64)
            iec_state = 137
       +5.012 : ATN=0, DATA=0(C64), CLK=0(C64)
            iec_state = 139
      +14.988 : ATN=0, DATA=0(C64), CLK=1
$EA0B            1541:   Receive bit of serial bus byte
            iec_state = 141
      +20.001 : ATN=0, DATA=1, CLK=0(C64)
            iec_state = 143
       +5.012 : ATN=0, DATA=0(C64), CLK=0(C64)
$EA1A            1541:   Got bit of serial bus byte
            iec_state = 145
      +14.988 : ATN=0, DATA=0(C64), CLK=1
            iec_state = 147

The +numbers are the time in micro-seconds since the last change to the bus lines.

What we can see from the above, is that the controller (marked as "C64") is waggling the data lines sending data bits, before the 1541 is ready to receive the first bit of data.

The problem is that we are releasing CLK too fast after seeing a drive pull DATA low.  The timing diagrams don't claim that any particular delay is required here, but clearly it is required. Looking at the C64's ROM disassembly, it uses a 1ms delay. So we have to wait at least 1ms before releasing CLK. That will be easy to fix.

I've fixed a bunch of problems there, but am finding that the 20usec for each half of the clock for a bit seems to be too fast for bytes sent under ATN to the 1541.  The loop in the 1541 is from $EA0B to $EA28.

First, it clears the EOI flag (5 cycles), then reads from $1800 to check the CLK line (minimum 8 cycles), then it grabs and stashes the bit (12 cycles).  Then it checks the state of the ATN line (19 cycles), does a de-bounced read of the clock line (16 cycles), decrements the bit counter and branches back to the top (8 cycles).  

In other words, the 1541 ROM requires at least 5 + 8 + 12 + 19 + 16 + 8 = 68 cycles to read a bit under ATN.  That's way more than the 20 + 20 = 40 listed in the reference guide. This totally explains why its missing bits.

70 usec would seem to be the bare minimum that we should try, to allow for some clock drift between host and device.  And, with that, I do now get the byte sent completely, and acknowledged by the drive. That's great!

Now, does the drive receive the byte that I think I sent to it? The completed byte gets loaded from location $85 at $EA2B, so I can see that in my VHDL simultion as well.  I am writing $28 to call device 8 to attention.  However, what gets loaded from $85 is not $28. Instead, we seem to load $AB.  

Hmm.. $AB = 10101011 in binary, which doesn't really look like any real relation of $28. The bits get stored using an ROR $85 instruction at $EA18.  Indeed I am seeing that sequence of bits that makes $AB get received and stashed. So the CPU looks to be doing the right thing.  Am I sending the right bits? Yes, it looks like I am.  So it must be something funny in my quickly cobbled together 6502 for the drive.  I'm suspecting the LSR A instruction as a likely culprit. Yup: Spotted a stupid error: The carry flag was being set from bit 0 of the last memory access, rather than bit 0 of the accumulator.  With that fixed, the VHDL 1541 correctly receives the byte $28 that I sent to it :)

Okay, so we are now at the point where we can send bytes under attention.  Or rather, at least one byte under attention.  I'm thinking I will continue the testing by asking the VHDL 1541 to send the contents of the error channel, i.e., to talk on channel 15.  This will require sending multiple bytes under ATN, as well as doing a turn-around to listener for the controller, and then receiving bytes from the IEC bus. Specifically, we have to send $48 then $6F under ATN, do a turn-around to listen, and then read bytes from the IEC bus.

Actually, it needs to be $4B then $6F, because the VHDL 1541 doesn't have its "dip switches" set to device 8 by default, but rather device 11.  With that, I can confirm that the 1541 is indeed seeing that it needs to talk, and having its secondary address set using my nice little program that parses the VHDL simulation output, to show what the simulation is trying to do, and what the 1541 is doing, by peeking at its program counter.  This is what I see as output now (with the boring bit-by-bit stuff cut out for brevity):

       +0.006 : ATN=1, DATA=1, CLK=1
$EBE7            1541: Enter IDLE loop
            iec_state = 100
            iec_state = 120
   +12117.426 : ATN=0, DATA=1, CLK=0(C64)
            iec_state = 121
$E85B            1541: Service ATN from C64
     +295.366 : ATN=0, DATA=0(DRIVE), CLK=0(C64)
            iec_state = 122
     +724.587 : ATN=0, DATA=0(DRIVE), CLK=1
            iec_state = 124
$E9C9            1541: ACPTR (Serial bus receive byte)
$EA2B            1541: ACP03A+17 Got all 8 bits
$E89F            1541: Received TALK command for device

$E9C9            1541: ACPTR (Serial bus receive byte)
$E9CD            1541: ACP00A (wait for CLK to go to 5V)
$E9CD            1541: ACP00A (wait for CLK to go to 5V)
            iec_state = 100
           iec_state = 122
    +2603.796 : ATN=0, DATA=0(DRIVE), CLK=1
            iec_state = 124
      +50.643 : ATN=0, DATA=1, CLK=1
            iec_state = 125
$E9DF            1541: ACP00 (saw CLK get released)
     +54.767 : ATN=0, DATA=0(DRIVE), CLK=0(C64)
$EA2B            1541: ACP03A+17 Got all 8 bits
$E8BE            1541: Received secondary address

So next step is to implement the turn-around from talker to listener, and add that to the test, ready for actually reading data.

The IEC turn-around is a special jiggle of the CLK and DATA lines so that the two devices talking switch who is the sender.

At the start, the computer is holding CLK low, and the device is holding DATA low.  Then the computer pulls DATA low as well, and then soon after releases CLK.  The device being asked to talk then pulls CLK low, and then soon after releases DATA.  At this point, DATA is being held low by the computer, to say that it's not yet ready to receive a byte of data, and the device doing the talking is also holding CLK low to say that it doesn't yet have data ready to send. But before all that, we have to release ATN and wait at least 20usec.

In short, it's really not that complicated, even though at first glance it sounds a bit byzantine. And indeed the implementation is quite simple:

            -- Wait 20 usec, release ATN, wait 20usec
            -- Computer pulls DATA low and releases CLK.
            -- Device then pulls CLK low and releases DATA.
          when 200 => micro_wait(20);
          when 201 => a('1'); micro_wait(20);
          when 202 => d('0'); c('1'); micro_wait(20);
          when 203 => milli_wait(64); wait_clk_low <= '1';
          when 204 =>
            if iec_clk_i = '0' then
              report "IEC: TURNAROUND complete";
              iec_state <= iec_state + 2;
              -- Timeout
              report "IEC: TURNAROUND TIMEOUT: Device failed to turn-aruond to talker wihtin 64ms";
              iec_state_reached <= to_unsigned(iec_state,12);
              iec_state <= 0;
              iec_devinfo <= x"00";
              iec_status(1) <= '1'; -- TIMEOUT OCCURRED ...
              iec_status(0) <= '1'; -- ... WHILE WE WERE TALKING
              iec_busy <= '0';
              -- Release all IEC lines
            end if;
          when 205 =>

            -- Device is present
            iec_devinfo(7) <= '1';
            iec_busy <= '0';

            -- Device is now talking
            iec_dev_listening <= '0';
            -- We are no longer under attention
            iec_under_attention <= '0';
            iec_devinfo(4) <= '1';
            iec_state_reached <= to_unsigned(iec_state,12);
            iec_state <= 0;

After this, when I run a test using it, we see that the 1541 is ready and waiting to send a byte:

      +54.347 : ATN=0, DATA=0(DRIVE), CLK=0(C64)
$EA2B            1541: ACP03A+17 Got all 8 bits
$E8BE            1541: Received secondary address
            iec_state = 200
            iec_state = 201
    +2638.711 : ATN=1, DATA=0(DRIVE), CLK=0(C64)
            iec_state = 202
      +20.001 : ATN=1, DATA=0(C64+DRIVE), CLK=1
$E8F1            1541: TURNAROUND (Serial bus wants to become talker)
            iec_state = 203
      +30.149 : ATN=1, DATA=0(C64), CLK=1
      +19.235 : ATN=1, DATA=0(C64), CLK=0(DRIVE)
            iec_state = 204
            iec_state = 206
$E909            1541: TALK (Serial bus wants to send a byte)
     +125.534 : ATN=1, DATA=0(C64), CLK=1

The bold parts in the trace above tell the story: 

1. We start with ATN asserted (=0V) at the end of sending the secondary address, which the 1541 correctly receives.  

2. Then the computer releases the ATN line (ATN=1). 

3. The computer then waits 20usec and asserts DATA to 0V, and releases the CLK line.  DATA was already asserted by the drive, so now both are asserting it (this ability to see who is pulling a line low is only possible under simulation like this, and is tremendously helpful).

4. The drive then stops pulling DATA low, leaving the computer with sole control of it, and then pulls the CLK line low -- it is now the talker, and the computer is now the listener.

5. Shortly thereafter, the 1541 drive reaches the code where it is ready to send the first byte.

I am expecting that first byte to be the digit 7 (=$37 in PETSCII), i.e., the first digit of the drive status message on boot. But we won't know for sure, until we implement IEC byte reception in our controller. But that's going to have to wait for the next blog post.

Saturday 21 October 2023

Debugging SDRAM and HyperRAM cache and other related things

The R4 board has both the 8MB HyperRAM that is on the R3/R3A board, as well as a new 64MB SDRAM.  For the R4 board, I have made the bitstream allow run-time selection between the two, but with the 8MB HyperRAM active by default, so that people don't accidentally make software that won't work on an R3/R3A board.

The problem is that in the process of doing this, it seems that I have tickled a bug in the HyperRAM access that is causing test_585 and test_710 to both fail.  test_710 I can't find the source for, but that's ok, as I have test_585 source here, and can confirm that it fails.

First, a bit of background about how HyperRAM and SDRAM accesses work:

Unlike the 384KB of fast chip RAM in the MEGA65, the HyperRAM or SDRAM are external to the FPGA and the processor.  They are in fact connected via the same mechanism that is used to access the cartridge port.  This adds some extra cycles to each access, making it quite slow. To combat this slowness, the expansion RAM interface presents an 8 byte cache line to the CPU, to allow it to read successive addresses much more quickly, with just a single wait-state, allowing 20MB/sec for linear reads.

Like all caches, things are never quite as easy as one would like.  Writes to the expansion RAM bypass the cache, and are written directly to the expansion RAM.  This leaves the problem that the cache line is not updated, but still contains the old value. Thus if you read, write and then again read from addresses are all in the same cache line (i.e., within the same 8-byte-aligned range of 8 bytes), the second read will read the original value.

Well, except that this shouldn't happen, because when the expansion RAM is written to, the cache line is marked as invalid. And indeed, that does work fine.

However, we do see some other odd effects.

If we write $00, $FF, $55 and $AA in turn to successive bytes of the expansion RAM, we get some rather strange effects:

Notice how only the first byte in the block of 8 holds its address.  The others seem to magically get the $AA value, even before it has been written. And that's actually a clue there:  It turns out that this particular issue is caused by the SDRAM having different timing compared to what was expected: It has one more cycle delay before a value is read.  

We can see this more clearly if I make a program that writes all zeroes to a block of 8 addresses, and then sets exactly one of them to something else, to see where it gets read back:

Notice than all cases the first 2 bytes are read back as $99, no matter where $99 is written within the 8-byte block, and that the $99 that has really been written has appeared two addresses late.  As the SDRAM reads 16 bits per cycle, this corresponds to that single cycle delay I mentioned before.

That one cycle delay is something that I already know about, so I have made this switchable in the bitstream by clearing bit 5 in $FFD37FE.  If we do this, we now see the writes being read back in the correct addresses:

Note that this is all with the SDRAM, rather than the HypeRAM. That's okay, as we will work through the SDRAM, and then move to the HypeRAM, once the issues are resolved with SDRAM access. Most likely the issues are common to the two, anyway, so fixing for one will fix for the other.

Okay, so let's try our first program again, that writes $00, $FF, $55 and $AA to each of the first 8 addresses.

Well, that's an improvement: It now works fine for the even addresses. It's just the odd addresses that fail.  Remember I said before that the SDRAM has a 16-bit wide bus? I'm finding it quite suspicious that all the failing bytes are for the same half of of that 16-bits.

But we know that writing a value to one of the addresses only after having just cleared the rest of them doesn't exhibit this problem, as we saw when we last ran the program starting at line 100.  So it's not as simple as the odd addresses not writing or reading properly from the SDRAM.

Our next clue is that the output from running the program changes based on whether we have just run the program at line 100 or not:

This, I think, is the cache playing up. I can test this theory by forcing the cache line to invalidate by reading from an address not within the cache line.

Aha! Now it's fine, because I force the cache to invalidate with line 35.

Ah, it's not about odd or even addresses, its about the 2nd consecutive address within a cache line, as the following shows:

So let's see if I can reduce the failing case to something as simple as possible, so that I can more easily test it under simulation of the VHDL.

This is as simple as it gets, while still reproducing the problem:

This should print 1, not 123, but because the cache line for $8000000-$8000007 has been read by the PEEK in line 320, when we PEEK again in 340, we see the old data. Note that the new value has been written to the SDRAM, it's just that the cache line has not been refreshed.

Playing around with things, this incorrect behaviour happens only when slow_prefetch_enable is enabled in the CPU.  Enabling or disabling the cache line in the CPU doesn't have any effect. That is, it isn't the super-fast short-cut logic for the external RAM specifically that is playing up, but rather the logic for prefetching data from expansion devices generally. This also means that this bug is not specific to the SDRAM, but applies also to HyperRAM.

So, POKE $FFD37FE,$14 fixes that problem for the SDRAM, and then the SDRAM seems to work perfectly for me.

HyperRAM still has problems, essentially not working when selected. I'm strongly suspecting a plumbing problem with the HyperRAM connection into the rest of the design.

Yup: Found one problem there: hyperram_data_ready_toggle isn't plumbed into the HyperRAM module.  That will require a synthesis run, so we will investigate that more tomorrow.

Okay, with that, I am now getting data out of the HyperRAM again.  However, it is shifted by one byte when reading back, as we can see here:

This bug is proving a bit more interesting, i.e., not having an obvious cause.  The HyperRAM controller has some debug registers at $BFFFFFx and $B0000xx that let us get some idea of what is going on.

Those registers suggest that the HyperRAM is getting the correct data written to it, and that the cache lines as read in the HyperRAM controller are being read correctly.  So it's presumably happening in the way that the bytes are being fed out to the expansion RAM controller and the CPU.

Here's an interesting little program I wrote to look into this behaviour:

The first big fat clue is that the value written to $8111111 is what gets read from $8000000. This says to me that there is some funny thing going on with the data presentation from the HyperRAM to the CPU when reading byte by byte.  

Note that this is all with the HyperRAM cache line logic disabled. i.e., we should just be getting slowish single byte transactions from the CPU being fed back via the expansion RAM controller. Thus we can exclude the cache from being part of the problem. Something is causing it to return the byte values one transaction late, it would seem. 

We can confirm this theory by shifting the range of bytes read up a few bytes, so that they aren't even aligned to a cache line 8-byte boundary, and observe whether we are seeing the exact same effect, which we do:

This all also suggests that if we read an address twice in a row, it should magically look right. Let's try that:

And look at that -- it's magically fixed :) Well, rather, we have worked around the problem with this little trick. Clearly we need to fix this at the root cause.  

So we now know that the symptom is that the byte from the previous read request is the byte value that is returned.  Now to figure out why this happens.

This is the relevant snipped of VHDL:

      if data_ready_strobe_hold = '0' then
        if fake_data_ready_strobe='1' then
          report "asserting data_ready_strobe via fake_data_ready_strobe";
        end if;
        data_ready_strobe <= fake_data_ready_strobe;
        if fake_data_ready_strobe='1' then
          report "DISPATCH: holding data_ready_strobe via fake data = $" & to_hstring(fake_rdata);
          rdata <= fake_rdata;
          rdata_hi <= fake_rdata_hi;
        end if;
        report "data_ready_strobe delayed by 1 cycle to ensure data setup";
        report "asserting data_ready_strobe";
        data_ready_strobe <= '1';
        data_ready_toggle_out <= not data_ready_toggle_int;
        data_ready_toggle_int <= not data_ready_toggle_int;
      end if;
      data_ready_strobe_hold <= '0';

All that stuff about fake_ prefixed signals is because to meet timing closure, the HyperRAM controlled needed an extra drive cycle to push to rdata and the strobe line.  This was done at the time by "faking" the output with the fake_ signals, and then presenting them a cycle later, and with the strobe signal being delayed by a further cycle, to make sure that the data lines had had time to settle in the previous cycle. This is required because the HyperRAM runs at 2x the clock rate of the CPU, and so we need to make sure that everything looks consistent from the slower perspective of the CPU.

Anyway, I'm highly suspicious that the problem lies here somewhere, especially since the logic doesn't seem to make a great deal of sense to me right now!

So what do we really need to do, whenever we have new data to present?

1. Present the data to rdata and rdata_hi.

2. On the following cycle, assert the data_ready_strobe signal for 2 clock cycles, to make sure it gets accepted by the 1/2 clock speed receiver. Also toggle data_ready_toggle.

Ok, so I have done that, and resynthesised.  Now the behaviour of the HyperRAM and the SDRAM is the same when caching is disabled.  However, it's still incorrect: The data from the previous transaction is still being read, instead of the correct data.

Let's warm up the unit tests I wrote for the SDRAM controller back when I was implementing it, and see what those can tell us.

==== Summary ========================================================================================
pass lib.tb_sdram_controller.SDRAM starts busy, and becomes ready                 (16.3 seconds)
pass lib.tb_sdram_controller.Can read config registers                            (13.3 seconds)
fail lib.tb_sdram_controller.Write and read back single bytes                     (12.5 seconds)
fail lib.tb_sdram_controller.Write and read back single bytes from different rows (12.6 seconds)
fail lib.tb_sdram_controller.Cache line correctly populated                       (12.8 seconds)
pass 2 of 5
fail 3 of 5

Well, that's certainly confirmation that something is wrong.  Wrong in simulation is good, because it can be iterated quickly to find out the cause.

Only the first 6 bytes of the 8-byte cache line are being read.   After a bit of hunting around, I found the cause: The byte mask bits for the SDRAM were not being set for every cycle during the read of a cache line.  This doesn't explain some of the other issues we are seeing, but it is certainly a real issue that we need to fix. I am synthesising a fix for it now to test.

So now that the old tests are all working, it's time for me to try implementing the access sequence that is resulting in reading from the previous transaction.

In our latest test program, I did:

POKE $8111111,$77
A = PEEK($8123456)
B = PEEK($8111111)
C = PEEK($8111111)

The values of B and C should both be $77.

To be more complete, we can make it like this:

POKE $8111111,$77
POKE $8123456,$99
A = PEEK($8123456)
B = PEEK($8111111)
C = PEEK($8111111)

What we now see is that B = $99 and C = $77

So now to make a VHDL test that can reproduce this effect. Well, we already have read and write tests that operate on just the SDRAM controller, and those pass.  So the problem must be in either the slow_devices or gs4510 modules.

I've patched up the test_hyperram.vhdl and simulatehyperram2 Makefile targets to work again, updating GHDL to version 4.0.0 in the process.     

Found the first part of the problem: The timeout logic in the slow_devices.vhdl was broken, causing all reads to timeout.  This explains the behaviour of seeing the result of the previous read, as that would be what the connected device would be presenting from last time.  

Now the problem is that it gets in an infinite loop retrying the read.  Fixed that, and now it is back to timing out, but only after the allowed 64 clock cycles.

Now to find out why it doesn't realise that the value has been read.  

The hyperram.vhdl publishes the memory read  value with 27 cycles to go, but it never gets detected by slow_devices.vhdl.  I'm guessing the data ready toggle is not being properly plumbed through.

Well, that looks promising: With it correctly plumbed, it seems to be working now. The test runs to completion, with no read errors.  So time to cook a bitstream while I sleep...

Meanwhile, I got a fix for the GHDL bug from their great maintainer, Tristan, so I can try simulation again to track it down.

To do this, I have put some test code in the reset entry point of the hypervisor:


        ;; get CPU state sensible

    ;; XXX - Debug HyperRAM read bug
    ;; The following in BASIC tickles it:
    ;;     POKE $8111111,$77
    ;;     POKE $8123456,$99
    ;;     A = PEEK($8123456)
    ;;     B = PEEK($8111111)
    ;;     C = PEEK($8111111)
    ;; For simplicity, we will use two different addresses in the same
    ;; page of HyperRAM, but in different cache rows.
    lda #$00
    sta $00
    sta $01
    sta $02
    lda #$80
    sta $03
    ldz #$00
    lda #$77
    sta [$00],z
    lda #$99
    ldz #$08
    sta [$00],z
    lda [$00],z
    ldz #$00
    lda [$00],z
    lda [$00],z
    foopl    jmp foopl    

In theory, this should do just what the BASIC code does, and give equivalent results.  I don't have hardware with me right now, or I would make a test program that I can run on that with exactly these instructions in it to test the behaviour.

At 3285ns I can see the slow_devices.vhdl module accepting the initial write request to address $8000000.

And then I found an interesting clue: the busy flag from the HyperRAM is staying asserted.  It looks like it is continuously doing cache pre-fetches or something. This would explain the behaviour of the bug: The HyperRAM gets around to servicing the read or write request some time later, after the slow_devices module has given up waiting. In that situation, the slow_devices module returns the last byte that the HyperRAM presented. That is, we would consistently see the value from the previous read transaction.

So next step will be for me to figure out why the HyperRAM is staying busy.

Well, it might be a red herring after all, as the HyperRAM requires a 150usec start up delay to get itself organised. 

I disabled the 150usec delay in the HyperRAM, and now can simulate it... except that the bug doesn't show up.  

So I ported the program to a C64 mode program and ran it there, and it also doesn't tickle the bug. How annoying. But also a bit of a relief, because if the bug couldn't be reproduced in simulation, then it would make my life much harder to track it down.

So what do we need to do to make the bug show up on real hardware using a program that we can run it under simulation?

Is it that we need a big fat time delay between the reads? That doesn't seem to make a difference.

I've gone back to looking at the test_585 code that does still fail reliably.  I now want to make a super minimal assembly program from that that fails on real hardware. 

So, that is failing for me. But after some hair pulling as I pulled that apart, it's now also failing for the simple BASIC test.

This is really frustrating, because it makes it feel like something is unstable, and depending on glitches, which means it might not be detectable under simulation.

The overall symptom remains one where reading from expansion RAM -- whether HyperRAM or SDRAM -- results in the CPU being given the value from the previous read, rather than the current read.

Now, I did make a fix for one potential cause of that issue, which I thought was already in the bitstream I am testing. But it's possible that the commit ID on the bitstream is wrong, if I made those changes to the source code while the bitstream was building.  The way to test this is to build a fresh bitstream, and see how that behaves...

I'm hopeful that will fix it, because a couple of things about the behaviour right now suggests that it is a bug in slow_devices, not in HyperRAM or SDRAM controllers:

1. The error is related to a single byte shift. Both HyperRAM and SDRAM effectively have 16-bit interfaces, so the expected shift from an error there would be 2 bytes.

2. The error happens with _both_ HyperRAM and SDRAM, and looks identical on both.  It seems fairly unlikely that I'd have function-identical bugs on both controllers. Not impossible, but unlikely.

Anyway, we will know in an hour or two...

Bitstream built, and I now have a new clue:  On cold boot, it behaves properly, but after running the test, it gets into a state where the read is consistenly one byte out, as before.  So I'm guessing some internal state somewhere is getting screwed up.  This explains the intermittant behaviour I have been seeing. So now to make a sequence that works from cold to trigger it exactly, so that I can investigate it under simulation...

Using test_585, the bug is typically triggered and then occurs persistently after somewhere between 20 and 400 loop iterations. So the probability of the event on any given read is quite low, which is a bit annoying.

My best guess, is that it is sensitive to the expansion RAM delivering a byte at some critical time. 

Hmm.. I just had a stupid bug in my test code, where for some reason the sta [$nn],Z was being done as a sta ($nn),Z, so writing over the pointer, rather than to the expansion RAM. That caused me endless "fun" until I realised what I had done.

Now, digging more, I have found that I can trigger the problem by repeatedly reading and writing to a single address, provided that the bottom 3 bits are in the range 5 - 7.  If they are in the range 0 - 4, then the problem never shows up.

But it still takes some variable number of iterations with a "bad" address, before the problem is triggered.  However, that number of iterations is now typically many fewer. It should now be fast enough to trigger, that I can test if I can reproduce it under simulation.

I'm wondering if one of the data ready toggles isn't being handled incorrectly, and the expected and actual sense of the line get out of sync. This would cause exactly the kind of issue we are seeing. This theory should be testable, to see whether after the problem occurs, it can be flipped back off again by continuing to attempt to trigger it.

Well, I ended up making a new bistream with the toggle handling reworked instead after I looked through the code for probable causes.  The toggle was only being updated when the slow_devices module was actively waiting for RAM reads. Thus if a read finished after the slow_devices module gave up waiting, as soon as it would schedule the next read, it would think the result was immediately valid, and thus would be returning the result of the previous read.  In other words, this was indeed a bug of the type I had suspected.

With that change, some errors are still occurring, but when they do, it goes back to reading values correctly after. The read timeouts were set quite low -- possibly too low for HyperRAM, and even possibly for the SDRAM as well. So I have increased those from 24 to 128 cycles at 81MHz, which should be plenty enough.

And it looks like that combination of fixes has done it!

Now, the SDRAM still has a problem with unreliable initialisation and clock/data line relative timing, which I will need to tackle.  However, that can wait for another day, as the SDRAM itself works fine, and is not required for the MEGA65 core -- I implemented it really only to test it.