The original Nexys4 boards had a kind of RAM called PSRAM, which is a DRAM dressed up to pretend to be an SRAM. PSRAM is wonderfully simple to control, and so it was easy to support. Then Digilent couldn't source the PSRAM chip any more, or got requests to have larger memory on the Nexys4 boards, and switched to a DDR2 DRAM. In comparison with the PSRAM, DDR DRAM is a horror to write a reliable controller for, and after considerable effort to do so, I gave up in disgust.
Then as Trenz were designing the MEGA65 main board, they said that there is a new standard for something like PSRAM called Hyper RAM (not to be confused with a Hypervisor, which is used to virtualise hardware to allow running guest operating systems, including on the MEGA65 to run the C65 or C64 ROMs as a "guest" on the MEGA65 hardware). Hyper RAM has become popular because it offers much higher bandwidth than PSRAM could, upto 333MB / sec versus about 10 MB / sec for PSRAM, and because it uses fewer pins (but note that we won't be getting anything like that performance just yet).
So Trenz added an 8MB Hyper RAM chip to the MEGA65 mainboard as our additional RAM. I did some initial work to write a controller for the Hyper RAM, but initially couldn't communicate with it. This was complicated by two things: First, the Hyper RAM interface is a bit more complicated than the PSRAM interface, which is how it gets both the higher bandwidth and reduces the pin-count. Second, with the Hyper RAM chip on the mainboard of the MEGA65, there weren't any test points to probe with the oscilloscope to see if I had it right. Thus it got pushed to the back-burner for a while, while we addressed other problems.
I now have it working, as you can see in the following photo, but it has been a bit of an adventure to get there:
Now, speaking of adventures, as many of you know, the family and I are living in the middle of the Australian Outback this year. So I thought I would just let you know that we are safely bunkered down in isolation here like the most of the rest of you. The main difference is that we are lucky enough to have a very big backyard to walk around in compared with most folks: The property here is about 640 square kilometres (something like 256 square miles), so it doesn't really feel as much like quarantine as it might otherwise.
The main difference for us, is that the Arkaroola Wilderness Sanctuary normally has lots of guests this time of year, but that's of course not happening right now. So for probably the first time in 40 years Arkaroola is very quiet at this lovely time of year. We are well into the Southern Hemisphere Autumn right now, which at Arkaroola means temperatures averaging around 25C during the day and about 15C overnight. It's lovely weather to camp out on our own back verandah and enjoy watching dawn break over the mountains:
Apart from that, everything here is running pretty normally. We still get our medical attention through the amazing Royal Flying Doctor Service. So when we all needed to get our Flu immunisations, the folks running the sanctuary gave them a call to get an appointment for all 18 people living on the sanctuary to get their jabs. The only difference is that instead of driving to the nearest doctor's rooms or waiting around and peeking through the curtains to see if the doctor was here yet, we drove to the nearest airstrip, and knew the doctor was arriving by the familiar drone of their Pilatus PC12:
Fortunately I've not needed to actually ride in one of their planes, although I have ridden in a PC12 about 10 years ago -- actually to this very same air strip. But that's a whole other story. They are a very fun and functional little plane, with a jet turbine actually spinning the propeller, and so for such a little plane are really fast, cruising at something like 550km/hour.
But anyway, I just wanted to reassure that you I are safely holed up here, so development of the MEGA65 can continue. So let's get back to the Hyper RAM stuff.
The Hyper RAM interface (we'll just call it HRAM from here, to save my typing fingers) is, on the surface of it, very simple:
HR_D : 8-bit data/address bus
HR_CS0 : Chip select/attention line
HR_CLK : Clock line (some have differential clock lines, but we can safely ignore that)
HR_RESET : Active low reset line
HR_RWDS : Read/write data strobe
The tricky bits are that HR_D is used for both data and addresses, and that the HR_RWDS line behaves quite differently between read and write. Oh, and because it is really a dressed-up DRAM, and the HRAM chip does the DRAM refreshing internally, the latency for a read or write request can vary, and you have to know what the possible latencies will be by reading (or configuring) some special registers in the HRAM.
Oh, yes, and although the HRAM has an 8-bit data bus, it is really 16-bit RAM internally, and so you still have to do complete 16-bit transactions, which are DDR, that is, data is clocked on both the rising and falling edge of the clock. You can fortunately mask writes on a byte-by-byte basis though, so that you can write a single byte. This behaviour is also controlled by the HR_RWDS line, making it something a bit like a Swiss-Army knife, with lots of different functions that you have to be a bit careful of, if you don't want to accidentally cut yourself.
So let's start at the simple end of things, which is writing the configuration registers that control the latency settings of the HRAM. Here is what the wave-form looks like:
(You might need to click on the image to see a bigger version of it)
There's a fair bit to explain here, so let's get started. The hr_clk_p line is the positive half of the differential clock, so just think about that as being the HR_CLK described earlier. The hr_cs0 line is the chip-select/attention line. This gets pulled low at the start of a new transaction. The HRAM then immediately drives the HR_RWDS line high (bottom line), to tell us that "extra latency" will be applied, because the HRAM is internally busy with a DRAM refresh right now. But in this case, we will ignore it, because we are writing to a configuration register (but we will explain it further down in this blog post).
The command to the HRAM is then sent over 3 complete clock cycles, with the HRAM accepting one of the six bytes on the rising and falling edges of the clock. So in this example, the command bytes are $60 at 5,547 usec, $00 at 5,559 usec, $01 at 5,571 usec, and then $00 for the remaining three bytes. Thus the complete command word is $600001000000. The command is sent most-significant-bit first. The first few bits (i.e., the highest numbered bits) of the command are:
Bit 47 - Read (1) or Write (0)
Bit 46 - Access to memory (0) or Configuration registers (1)
Bit 45 - Wrapped (1) or linear (0) access
Bit 44 - 16 - Upper address bits of the access
bit 15 - 3 - Reserved
Bit 2 - 0 - The lower 3 bits of the address, counted in 16-bit words, not bytes
Thus we can deduce that this request is a write to the configuration registers, with a byte address of $00001000. This is the address of the CR0 register which contains the settings for latency for all other transactions. CR0 is a 16-bit register, and we write $8FE6 into it. The second $8F is only there because my state machine is wonky. The $FA is also just a debug value that my controller was producing so that I knew that it knew that it had finished the transaction.
The observant among you will notice that HR_RWDS has gone from being driven high to going tri-state just before the last byte of the command has been sent. This is caused by the HRAM stopping indicating the latency status before the bus master might need to drive the HR_D lines (or the HR_RWDS line), so as to avoid cross-driving any signals.
So, that's all pretty straight forward. Where it gets a bit more entertaining, is when you try to access the memory of the HRAM, because now we DO need to worry about the latency. We'll start with reading, because that case is actually a bit simpler.
So here we see a similar start, where we pull HR_CS0 low, and then send out a command, this time $A00001000004, which corresponds to reading from address $00001008. This time, because the HRAM knows that it will be supplying data to the bus master, instead of tri-stating HR_RWDS, it keeps it low until the read latency has expired. The bus master doesn't need to know what the latency will be, as the HRAM will keep HR_RWDS low until it has the first word of data ready. Then it pulses HR_RWDS to mark the two bytes of each word that it returns. In this case, $00001008 contains $FF, $00001009 contains $49, and the next seven addresses all also contain $49. So that's not too complicated.
This ability to read many bytes in a single transaction is key to HRAM's ability to deliver high throughput. If this weren't the case, then about 4 clock cycles would be wasted on every access, just fiddling with HR_CS0 and sending the command -- let alone the latency for the HRAM to dig up the required data out of its DRAM, in this case another two clock cycles. In short, a transaction to read a single byte would require about eight clock cycles. Thus at the maximum clock speed of 166 MHz, HRAM would only be able to deliver about 20MB/second. However by reading lots of data at a time, it is possible to get relatively close to the theoretical maximum.
Now, in the case of the MEGA65, we can't get anywhere near that throughput, because our clock speed internally is not that high, and because the version of the HRAM chip we are using is actually only capable of a 100MHz maximum clock speed anyway. As we need to synchronise with the MEGA65's existing internal ~40MHz CPU clock, we can only use 2x that for ~80MHz as our maximum clock rate.
And for now atleast, it's actually even worse: I haven't had the time to work out how to use the DDR signal drivers in the FPGA, so I am simulating the DDR access using a 40x4 = 160MHz clock, where four clock cycles are required for each HRAM clock, resulting in a 40MHz effective clock rate. This is because when writing bytes to the HRAM, the data has to be set up between clock edges, but when reading data back from the HRAM, it is synchronised on the clock edges. So the state machine alternates between ticking the clock and reading or writing data. This could be improved in the future, but as I have the HRAM working at an acceptable speed, it has been pushed to the back burner.
What is clear in any case, is that having about ten clock cycles per access (by the time we take the HRAM internal latency wait cycles into account) will result in horrible performance, even if we were able to double the clock speed to 80MHz. This is because the CPU will still be waiting at least ten cycles (or five, if we can use the FPGA DDR resources to double the clock speed) for EVERY read. So a three byte instruction would take at least 3 x 10 (or 3 x 5) clock cycles.
The reality is actually somewhat worse, because the HRAM logically connects to the "slow device" bus in the MEGA65's memory architecture, so that it doesn't complicate the CPU's inner loop to the point where we would have to drop the CPU speed. It costs one cycle in each direction for the CPU to ask the slow device bus, and for the slow device bus to ask the HRAM, and then another cycle each for the answers to get communicated back. Thus the real access time is something like 1 + 1 + 10 + 1 + 1 = 14 cycles. Even if we double the HRAM clock via DDR to 80MHz, it would still be 1 + 1 + 5 + 1 + 1 = 9 cycles. Thus while getting the DDR stuff working would still be a great idea, it would represent only a modest improvement.
What we really need is some kind of cache, so that we can take advantage of the HRAM's ability to read multiple bytes in a single transaction, and then to be able to look at the data we have recently read as often as possible, instead of having to wait for a whole HRAM transaction each time. This has the ability to completely remove the 10 (or 5) cycles, reducing the access time from the cache to 1 + 1 + 0 + 1 + 1 = 4 cycles. That's sounding much better. In fact, we can do even better, if we make the cache accessible from the slow device bus, we can reduce this to 1 + 1 = 2 cycles. Of course, if the cache doesn't have the data, then we will still need to do the HRAM transaction, but if we can do that less often, then we will still see a significant speed up.
When writing data, we can do things a little differently: The CPU can (and does) just hand the write request to the slow device bus, which in turn hands it to the HRAM controller to dispatch in the background. The CPU can then continue with whatever it was doing. If that doesn't immediately require another write to the HRAM, the CPU can get on with useful work. This is called "latency hiding": The latency is still there for writing to the HRAM, but we can hide it, by allowing the CPU to get on with other work.
We don't have stacks of spare RAM in the FPGA, as we are already using every single 4KB BRAM block. Also, caches are most effective when they have high "associativity". That's just a fancy way of saying that the cache can more effectively hold bits of memory from all over the place, without having to discard any of the others when reading a new one because the too many of the bits of the addresses of the already cached data and the newly fetched data are identical. Also, an 8-bit computer with a huge cache just wouldn't feel right to us. The net result is that the MEGA65's HRAM controller has two cache rows of eight bytes each, with two-way associativity. Since the associativity and the number of rows is identical, this means that you can have any two areas of eight bytes. Don't worry if that sounds tautological, as with such a tiny cache it basically is a tautology. The bottom line is that the two cache rows can be used independently, e.g., if you were copying data from one area of the HRAM to another, or executing code in the HRAM, while also reading or writing data in it.
So let's see what that cache gets us, with a simple speed benchmark I wrote as I was implementing this:
The top half of the display just describes the HRAM's internal settings. This was before I enabled "variable latency" and reduced the HRAM's internal latency. The 20MB/sec is for fast chip ram to fast chip ram copies, i.e., testing the MEGA65's main memory as a comparison. The copies and fills are all using the MEGA65's DMA controller. We then have "cache enabled" and "cache disabled" results for the HRAM in various scenarios, including when we copy between the HRAM and the MEGA65's main memory, which we expect to be a fairly common situation, e.g., when software uses the HRAM in REU or GeoRAM compatibility mode.
Let's start by looking at the "cache disabled" figures to see just how bad the situation is, if we don't do anything to hide the read or write latency. Indeed, it is pretty horrible. Copying from Extra RAM, i.e., the HRAM, to Fast RAM is probably the most representative here for reading, as the time to write to the Fast RAM is only one cycle per byte. Here we see only 1.6MB/sec, which given we know the clock speed is 40MHz, and only one cycle is lost to the Fast RAM write, means that the HRAM reads are taking something horrible like 23 cycles. This isn't too surprising, because with the default HRAM latency, plus the 1 + 1 + 10 + 1 + 1 transaction overhead, we quickly rack up the cycles. A C128 can read memory faster than this!
Now, if we look at the "cache enabled" section, we can see that things are already quite a bit better: We get 5MB/sec, i.e., around three times faster. This is because with an 8 byte cache, we only need to pay the 23 cycles one out of eight times. The other seven out of eight times we only need to pay 1 + 1 + 1 + 1 = 4 cycles (the short-cut cache in the slow device controller was not implemented at this point in time).
The next step was to reduce the HRAM's internal latency settings as much as possible. This reduced the HRAM's latency from being 6x2 = 12 cycles all the time, to instead being 3 cycles most of the time, and 3x2 = 6 cycles otherwise. This helped quite a lot, even without the cache, since it is just fundamentally reducing the time taken for each transaction:
I mentioned earlier that the HRAM is internally using 16-bit words. At first I didn't properly accommodate this, and would try to make the write transactions as fast as possible, by aborting the write after writing the single byte I needed to write. This gave the rather weird effect that writing to even addresses worked just fine, but writing to odd addresses gave, well, rather odd results. For example, the byte would get written to the correct location, and then often to the next odd address, or to an odd address somewhere else in the memory. Eventually I realised that this was because the HRAM internally requires every transfer to be a multiple of 16-bits, and if you don't obey this when writing, then very odd things indeed can happen.
Once I had the writes working reliably, I realised that improving the write performance would require the implementation of a "write scheduler", that could collect multiple writes, and pack them into a single transaction. In practice, this means another structure that looks very much like the read cache, but is instead for collecting writes. It marks each byte that needs to be written in a little bitmap, and as soon as the HRAM bus is idle, it begins writing the transaction out. There is even a bit of logic (that could do with some further optimisation) that allows writes to be added into a transaction that has already started, provided that they don't arrive too late. Like the read cache, this potentially allows up to 8 bytes to be written in a single transaction, and should give a similar speed up, and indeed it does:
Writing is now effectively the same speed as reading, because it is being turned into a similar number of grouped transactions. It was actually really pleasing to get this write scheduler working, as it is can have many subtle corner-cases, but in practice, was fairly simple to get working.
Now, if I had infinite time to optimise things, I would go further, for example, by:
1. Figuring that DDR thing out, so that I can effectively halve the remaining HRAM latency.
2. Monitor reads from a cache line, and when we see linear reads getting towards the end of a cache row, scheduling the reading of the next cache row.
3. Actually read double the data that fits in a cache line, and have it waiting in the wings to do a quick cache update when reading linearly.
4. Similarly for writing, if we see that the next writes are following the current write transaction, to just extend the current write transaction.
5. Seeing if I can make the cache lines wider.
6. Use the "wrapped" feature of the HRAM to begin reads at the exact byte in a cache row, to shave a couple cycles off random reads of bytes that are near the end of a cache row.
7. Generally optimise transaction timing by removing some of the glitching at the end of transactions, that is simply wasting time.
8. Figure out if can push the clock up to 100MHz exactly, although that will involve some kind of clock-domain crossing mechanism, which are often more trouble than they are worth.
However, we don't have infinite time, and I have a bunch of other tasks to tick off for the MEGA65, so 8MB/sec will have to do for now. In any case, it is certainly much better than the original ~1.6MB/sec. So I'll take the 5x gain, and call it quits for the moment. There is of course nothing stopping me (or someone else) revisiting this in the future to get nice speed improvements out of the HRAM. With those approaches above, it should be possible to perhaps double or better the current speed.
For those interested in following the adventure of getting the HRAM to this point, it's all documented in the github issue.