Wednesday, 1 July 2020

Built-in Sprite Editor Progress

Just a quick post to report on progress on the integrated sprite editor that we are building as a plug-in to the Freeze Menu.  The reason for doing this is that unlike BASIC 7 on the C128, the C65's BASIC 10 doesn't include a sprite editor.

We have already made this plug-in framework, where the Freeze Menu can call helper programmes for the audio mixer.  So it was really just a case of making the sprite editor itself.

And this is where one of our volunteers Hernán comes in: He has been working on the actual editor.  Together we merged that in as a plug-in to the Freeze menu, and hooked up all the plumbing, so that exiting the sprite editor returns to the freeze menu.  We also jointly implemented simultaneous joystick and mouse support, so that its is super comfortable to use.  To enter the sprite editor, you simply press S from the familiar Freeze Menu.  You then get something like this:

You can then use the keyboard cursor keys to move around, and toggle pixels. You can also see that there is provision for multi-colour sprite modes.   Now, the screenshot tool doesn't currently show sprites, which I only just realised while pasting the screen-shots in. This is annoying, because it means you can't see the sprite pointer in the following, where I have used the mouse to move the cursor and draw:

One of the nice things that has come from this, is that we have made library functions for using the mouse as part of the mega65-libc. Hernán also wrote simple console output library functions for drawing the display.  This means that we have a new set of APIs that are easy for other programmer's to use in their own programmes. When I get the chance, I'll start documenting the library functions to be included in the MEGA65 Book, and likely in a smaller separate MEGA65 Cross Developer's Reference Guide.

Anyway, the editor is now at the point where the hard parts are mostly done. Next steps will be to make it actually edit the sprites from the frozen programme's memory, rather than from in the memory of the sprite editor process.

Sunday, 28 June 2020

Ultrasonic communications for the MEGAphone: Testing speaker and microphone performance

This might sound like a rather strange thing to be working on, but there is a reason.  The NLnet Foundation has taken an interest in the MEGAphone as a secure and "sovereign" communications device, that can play a role in civil society.  This largely aligns with what I have previously said about the need for such self-sovereign communications systems in the face of the coming digital winter.  

In short, NLnet have kindly agreed to fund a body of work on advancing the MEGAphone, and by implication, advancing the MEGA65 project. This means that I will be spending a considerable amount of time in several pieces between now and early next year working on the MEGAphone.  This means I am going to be spending more time working on the MEGA65 as a whole than I otherwise would have, so it's a positive all round.  But I understand that some folks are only interested in the MEGA65 as a retro-computing platform.  In which case, feel free to pay less attention to these posts. That said, they will still be very much focused on fun retro development, and solving very much real world problems with such a system. After all, its not every day that someone makes an 8-bit smart-phone that can communicate via ultra-sound!

If borders open, and international travel becomes feasible again, this will be done in  Darmstadt with the rest of the MEGA65 team, but until then, I'll continue to work from here, since there isn't really any alternative.
Speaking of the effects of COVID19, this has also slightly tweaked the work plan, as NLnet are keen for us to look at how the idea of a sovereign and fully open phone can be used to produce a more privacy-protecting and less error-prone form of contact tracing. In particular, they are interested to know how practical implementing near ultra-sound communications would be, to see if it makes sense as an alternative to bluetooth-based proximity detection. Ultrasound has some nice theoretical advantages, like not working through walls or other barriers that are likely to also be effective barriers against virus transmission. Thus it has the potential to reduce the false-positive rate.

Exploring ultra-sonic communications is something that I have been wanting to do for a while, and was already on the plans fora the MEGAphone as a means of resilient communications.  For those who remember when I first talked about the MEGAphone, they might recall that it already has a bunch of weird communications modes available, including an IR LED that can probably turn TVs off from 200m away.   It also has not just one microphone, but an array of 4.

The microphone array was intended for making it easier to to cancel background noise, and detect the direction a sound is coming from, so that speaker-phone mode could do various party tricks with it. However, having read about ultrasonic communications around that time, I chose microphones that are sensitive well into the ultrasound range.  So, its quite possible that we may be able to achieve ultrasonic communications with the MEGAphone.  And that is the first milestone in the NLnet project:

1. Assess near ultrasound capability of existing MEGAphone prototypes.
There already exists several revision 1 hardware prototypes of the MEGAphone, that form the basis for forward activity on the project. These include MEMS microphones and amplified speaker output functionalities that have the potential for near ultrasound communications. The purpose of this sub-task is to assess the feasibility of this, and gain an understanding of what may be possible. It is entirely possible that this will prove infeasible with the current hardware, in which case the reasons and potential remedies for this will be considered for inclusion in a future revision of the MEGAphone hardware.
  • Examine the existing components of the MEGAphone r1 and r1b prototype hardware, in particular the MEMS microphones and speakers, to determine their theoretical suitability for near ultra-sound communications.
  • Determine the ultrasonic frequencies and bandwidth that are likely to be possible to use, and consider the constraints that this is likely to place on any protocol designed to use this facility.
So my first goal here is to look at the speakers and microphones in the MEGA65, and, and see what their theoretical properties are, and whether they have the potential to be used for ultrasonic communications, and if so, what frequency bands we expect to be able to use. Of course, we could also consider dedicated ultra-sound communications, but the goal here is to use existing speakers and microphones, rather than increasing the bill of materials of a phone.  So let's get started:


The MEGAphone R1 prototypes have four MEMS microphones. As physically very small structures, they have a natural resonance that is well into the ultrasonic frequencies. The ones that we are using are the SPW0690LM4H.  According to Table 2 of that datasheet, they have a Resonant Frequency Peak at 26KHz.  Also, Table 2 tells us that at 15KHz they are 3dB more sensitive than at 1KHz.  Thus we can expect that they will likely be competent up to probably around 40KHz or so. Figure 9 provides some more information, showing the frequency response:

Here we see a peak at 26kHz as promised. We also see that sensitivity is quite reasonable all the way to the 80kHz limit they show, and in fact from ~72kHz the sensitive is above the sensitivity of the acoustic range.  So while somewhere around 26kHz would be ideal, with a benefit of close to +20dB versus the acoustic band, any ultrasonic frequency up to at least 80kHz should be usable.  So the receive side is not likely to be a problem.


The speaker in the MEGAphone is a CMS-40504N-L152.  As this is not on the PCB, we can in principle easily swap it out for another.  However, hopefully we won't need to do that, because these speakers are simply fantastic for the MEGAphone. They are super-loud and take up to 2W for loud ringing and playing games, and have good frequency response thanks to their relatively large 40mm diameter. And for all that, they are only ~5mm thick.

Here we have a less positive prospect: It claims a maximum frequency of 7kHz.  This will presumably be the maximum frequency with reasonably flat response, above which it waill presumably roll off.  Digging through the datasheet we find:

Well, this is much better than feared.  Yes, above 7kHz, it is 10dB below the lower frequencies. But it isn't a flat roll-off. Rather there are a couple of interesting peaks, the first at ~9kHz, and then a second at what looks to be ~18kHz, after which there is a similar big drop-off.  Unfortunately, they don't show any higher frequencies on the table. This is a bit unfortunately, as 18kHz is still audable for some younger people (I can still just hear 17.4kHz, and maybe a bit higher, and I am in my 40s).

But I am just about willing to bet that there is another peak at ~3x9kHz = ~27kHz, which would nicely coincide with our microphone's peak sensitivity. If possible, that would be a particularly effective combination. The question is whether the speaker is still loud enough, and whether we would start to get audible distortion.  This will require some experimentation.

But to summarise: 18kHz should be fine, and if we can confirm it, ~27kHz has potential.  27kHz also has the advantage that it is safer at high volume levels than frequencies below 20kHz. There is still the residual problem that some animals are quite sensitive even around 30kHz, cats in particular (although it looks like tuna fish won't be bothered in the least).

Either way, the frequency response around these frequencies is a bit sharp, so the bandwidth available is likely to be quite narrow -- perhaps less than 1kHz.  This will affect what wave-forms we might try to use for a communications protocol.  For example, frequency shift keying might be problematic, because allowing for variation of peak frequency resonance among the speakers and a bit of Doppler shift etc. Similarly chirped spread spectrum would be problematic, due to the very narrow frequency band available for the chirps.
Although for contact tracing, we can probably assume that if you are moving past someone fast enough to cause a large Doppler shift, you are probably unlikely to catch anything from them -- unless perhaps you are on a fast merry-go-round near a stationary person for a long time.  Thus we can probably ignore the problem of large Doppler shifts due to high velocities.

Would other speakers be any better?

There are certainly other speakers that have higher rated frequencies, and thus might be capable of higher frequencies.  We'll examine that a bit later, if it turns out to be necessary. This is because we need only a few metres range, and if we can achieve this with the existing speaker that is optimised for our other needs, then there is no point changing it.  To determine that, we need to consider out total link budget, similar to if we were using radio.

Link-Budget Estimate

Lets now think about what the link budget at 18kHz and 27kHz would be. We will start by examining the microphone sensitivity and maximum speaker volume, and then subtract for the free-space path lost as the ultrasound disperses through the air.

According to the speaker data-sheet, we should be able to produce a signal of ~94dB SPL (sound pressure level).

The free-space path loss through air is relatively minor for the frequencies and distances that we are concerned about. According to this calculator, the path loss will be <1dB per metre. However, the total path lost due to the inverse square law etc, is more like 20dB over 1m, ~34dB over 5m and ~41dB at 10m

Finally, we have the sensitivity of the MEMS microphone to consider. Here I am having to stretch my understanding of the data-sheet -- so if anyone reading this spots any errors in my reasoning, please let me know, so that I can fix it. 

The MEMS microphones indicate a signal-to-noise ratio of 82dB for near-ultrasound frequencies (around 20kHz) with a 94dB SPL (sound pressure level) sound source. We know that this should improve as we approach the resonance at 26kHz, so we will assume that we can detect down to 94 - 82 = 12dB.

Pulling this together, at a range of 10 metres, we should have a link budget of 94dB SPL - 41dB - 12dB = 41dB.  If we allow 20 or 30 dB for multi-path interference and all the usual horrors of propagation, we should have a signal with an amplitude of 10 dB to 20 dB, i.e, 10x to 100x the background noise level.  In reality, this might be somewhat worse due to interference from ambient noise sources, and muffling effects, such as having the device in your pocket or bag.  That said, if we are aiming for 1.5m instead of 10m, there is better than 10dB margin to be gained.  Also, all of this assumes an omni-directional sound source, which will not be true -- but it is a starting point.

However, overall, it seems like we should have a positive link budget at the end, with an SNR of somewhere around 10dB to 20dB.  So for now, we don't need to immediately try a different speaker. 

Constraints on Protocol Design

So it sounds like we should have a signal of a few hundred Hz bandwidth, and with a link budget of 10 -- 20 dB.  Within this constraint, there is still considerable freedom for selecting a solution.  Optimising that would require considerable effort and expertise beyond what we possess, or what is in this case actually required: This is because contract tracing requires only very modest data transfer rates.  Each beacon can potentially be only a dozen or so bytes long, and need only be sent every minute or so.  Given most guidelines are for 15 minutes of close proximity, this would allow significant redundancy in the beaconning, to help reduce the probability of false negatives. 

If we allow for 15% channel efficiency for a simple ALOHA approach, and up to 100 people within range of each other at a time, this means that each beacon must consume less than 0.15% of each time step. For a 1 minute beaconing interval, this corresponds to an air-time of 90 milliseconds.  Assuming we need a 32 baud synchronisatiaon preamble and a 64-bit random token, this would require a data rate of ~96 bits / 90 millisecond = ~1kbit / second.  This seems totally achievable within the expected channel characteristics.

If the token consisted of a 48-bit unique token and 16-bit CRC, this would provide for robustness in the protocol, while still maintaining a very low false-positive rate.  We can use the Birthday Paradox to compute the beacon collision rate with 2^48 tokens and not more than 2^32 simultaneous users of the system:  The probability of any two users using the same beacon at the same time would be ~1/(2^(16/2)) = ~1/256. Thus we would expect of the order of 10^1 colliding beacons per day, globally.  Assuming that less than 1% of users of the system test positive for the virus on any given day, this would result in ~10^(1-2) = ~1/10 false positive situations per day, i.e., of the order of one person per week being erroneously told that they had been in contact with someone with the virus.  This is almost certainly below the noise floor of the viral testing procedures and various other factors. Thus we accept this false positive rate.

As can be read below, a channel bandwidth of around 1KHz seems quite possible -- provided that the other technical barriers can be resolved.  Thus while careful protocol design and implementation would be necessary, the channel bandwidth would not impose any particularly troublesome constraints on the protocol design.

In fact, the very short range of communications that would be likely realised would result in rather more relaxed constraints than described above.  This could allow for a reduction in data rate and/or longer packets containing more information, which could be used to effectively eliminate the remaining false-positive rate.

Conclusion and Recommendation

For the full detail on how the following conclusions were reached, read the "Appendix: Experimental Verification" section below, but the TL;DR version:

1. It is possible to perform bi-direction near-ultrasonic communications using the existing hardware of the MEGAphone prototypes.
2. There is likely sufficient bandwidth to implement a realistic contact tracing facility.
3. My suspicions about the real-world practicality of this led me to go beyond the scope of the milestone and actually test the components, which identified several important factors, including:
4. The audible artefacts, limited communications range, and by implication, the power budget, of near-ultrasonic communications in this context is rather problematic. 
5. It would seem that for all of its problems, Bluetooth is a better solution, due to its vastly superior performance in terms of range, lack of audible artefacts and existing wide-spread deployment.
6. I do not recommend further pursuit of creating a near-ultrasonic contact tracing system using mobile phone hardware at this point in time.  This conclusion does not impact on the creation of bespoke devices that avoid these problems through careful design.
7. While not well suited to a contact tracing use-case, we have established near ultra-sound as a quite feasible means of digital communications using off-the-shelf mobile phone parts. This may be of assistance to people operating in areas where electromagnetic communications are denied or heavily surveiled.  

Appendix: Experimental Verification

It's all well and good to say that the above should work. But as we know, saying should in computer science usually means "won't despite all expectation". Or as my German friends like to remind me "eigentlich ist die stärkste Verneinung", i.e, "Actually" or "In Reality" is the strongest contradiction.  Thus we want to make sure that we are not barking up the wrong tree, and want to at least observe and characterise the performance of the MEGAphone's actual near ultra-sound performance.

My approach here is to produce an ultra-sound tone using the speaker, and then pick that tone up using the microphone array.  This will done on the MEGAphone using some software I have written for this purpose.

The first step is to make sure that the MEGAphone microphones really are sensitive to ultra-sound.  For this, I used a tone generator app on my boring Android phone to make sure that I can pick it up.  The MEGAphone R1 prototype has one dead MEMS microphone, so I had to fiddle to pick one that was working, and that was conveniently located where I could get to it. This let me test that I could receive a 20kHz tone. The signal to noise ratio (SNR) I didn't measure, because I don't know the actual SPL loudness of the phone's speaker that frequency, or what funny filtering the Android phone has.  I did discover that my daughter can hear upto 19kHz quite fine, when she started to complain while I was testing ;)

So now the next step is to modify the MEGA65 test programme I wrote so that it can also produce the tone.  This will be fairly easy, as the MEGA65's audio output is continuously integrated using a pulse density encoder (PDR), so sample rates up in the ultrasound range shouldn't be a problem.

Also, running the CPU at 40MHz gives us plenty of horse power to do this -- or at least I hope so. I will run a non-maskable interrupt at some multiple of the target frequency, driving a small sine curve table of values at maximum volume.
The interrupt loop will need to be quite tight, but it should be okay. Something like:

   LDX $FD
   LDA ($E0),X
   AND #$0F
   STA $FD

That will take something like ~35 cycles, including the interrupt entry and exit, plus a few cycles jitter while the CPU finishes whatever instruction that the CPU was executing when the interrupt is triggered. So close to 40 cycles = 1uSec. Thus our over-sampled frequency can be up to 1MHz.  If we have 16 entries in our sine table, then this allows up to 1MHz / 16 = ~64kHz as our maximum effective frequency.  That should be sufficient, since we only need 1/2 that.  Thus about 1/2 the CPU time will be spent on the interrupt, leaving the other half to read the microphone samples and visualise them.

Ideally we would synchronise the sample reading with the writing, as this would give us a display that is always synchronised with the tone we are transmitting.  So I might rework the above to read and stash the microphone samples, as well. That will eat more CPU time, though, so I'll probably have to optimise it carefully.  It might be, that instead of using an NMI to drive this, that I just have a tight loop that does both, and interleaves this with doing the visualisation.  The goal here is not something that is perfect, but rather, something that demonstrates that we can produce and receive a signal.

As I was thinking about it overnight, it occurred to me that I might just be able to use the SID chip implementations in the MEGA65 to produce a tone at the target frequency.  This would avoid all the CPU timing complications.  The catch is that the SID doesn't do pure sine waves, but only triangle, saw-tooth and square wave waveforms.

The problem here is that this means that it will introduce harmonics.  I don't know if the harmonics will only be at higher frequencies -- and thus not audible -- or whether there will also be audible harmonics on the lower end.  I'm not a signal processing expert, but my intuition and little bit of radio experience tells me that we can expect at least hetrodyning from reflections. That is, the signal will mix with its reflections, and this will produce the sum and difference of each frequency present in the signal.  If any of those harmonics are loud enough, it could produce an audible artefact. The good news is that this is, by definition, testable.  So I can live with that.

So, coming back to producing tones with the SID chip, I need to find out what the maximum frequency that the SID can produce is. Fortunately I have done a bit of mucking about inside the VHDL SID implementation we are using, so I know, for example, that it internally uses a clock in the 10s of MHz to generate everything, so tens of KHz shouldn't be a problem. Also, the SID chip's frequency generation formulae are well known, so we should just be able to work out the correct register settings, and again, just try to produce something.

Let's start with working out the register settings... and here we hit a snag: Although the internals of the SID chip are capable of much higher frequencies, the registers for the frequency generators can't go above about 4KHz, because the 16-bit frequency values don't offer enough dynamic range.

Back to the drawing board then. The 16-bit digital sample registers on the MEGA65 will be much more effect, and allow us to generate close to a sine-wave, as previously discussed. The only problem is feeding them fast enough. It might be possible to make a really tight CPU loop to do this, but it will be a bit of a pain for playing the tone while also processing the incoming signal. 

What I am thinking for a solution to this is to add Amiga-style intelligent audio DMA logic, so that I can point the machine to a sample table, and have it play the sine curve over and over at arbitrary sample frequencies.  It will also be handy for the MEGA65 retro-computer in any case.  For this, we will need the following information for each channel:

1. Base address of the sample data (28-bit address)
2. Length of the sample (16-bit number of samples)
3. Sample frequency
4. Volume level (maybe)
5. Flag to select 4, 8 or 16-bit samples (maybe)

Those last two are maybes, because they aren't strictly required, but will give more flexible output and make more effective use of memory, respectively.

The next question is where to locate them in the system. Unlike the Amiga, the MEGA65 has the CPU always being the bus-master.  This means that it will end up in the CPU one way or another.  There is a DMA controller in the CPU already, and that could be hacked to provide the means of setting things up.

It also actually reminds me of another big problem for the MEGA65 at least:  DMA jobs on the MEGA65 are not interruptable. This means that if I implement this sample playback method, the sound will pause whenever a DMA job is running.  With a bit of clever pre-fetch and buffering, I can probably hide this problem for all but the longest running of DMA jobs.  Given a CPU frequency of ~40MHz, and the maximum 64KB DMA copy requiring ~130Kcycles (or ~260Kcycles for swaps, when they are implemented), that corresponds to about 7 milliseconds.  If the programmer were disciplined, and broke the DMA jobs down into 1KB pieces, then we can get that figure down by close to two orders of magnitude, to ~0.1 milliseconds, which corresponds to a sample frequency >8KHz.  That sounds like a reasonable situation for now, and I can always make the DMA jobs interruptable by the audio data fetch sometime down the track.

Anyway, all of the above means that a buffer of even just one sample per channel should allow decent audio, even with DMAs running.  If I make the buffer even just 4 or 8 samples deep, then much higher frequency audio should be possible -- allowing us to produce ultra-sound while running any little DMA jobs that we might need in the test programmes I will need to write.

Since I've just talked myself into using the DMA controller, I might just make the control for this via unused extended DMA job options, rather than having more memory mapped registers.  It also means that setting up a particular sample will be easier in practice, because you will just be able to trigger the relevant audio DMA job, rather than having to stuff a pile of registers.  I'm liking this plan even more... except it will be a pain for freezing.  So I'll have to find some spare memory-mapped register space, after all. Oh well, it sounded like a great idea.

Time to get cracking: First I need to setup the data structures, and create the memory mapped registers to access them. Those are $D720-$D75F, with $10 registers per channel.  To that I have added all the behind-the-scenes plumbing that does the sample fetching, volume control and mixing.  The whole setup is not over complicated, but simulation is still the best way to ensure correct behaviour. And that's where the problem has arisen...

When I simulate it, GHDL is not incrementing various counters in this new stuff correctly.  This seems to be because it thinks that lines are being cross-driven.  The trouble is, all the inputs to the calculations don't seem to have any undefined or other invalid values. This led to a whole rabbit-hole of investigation that took several full days to explore, trying to get backtraces working in GHDL, so that I can see where the meta-values are being produced, as well as dealing with GHDL crashing during compiling the code for synthesis, among others.

After various adventures, I got GHDL built using the GCC back-end with at least rudimentary back-trace support.  So now I am working through the process of finding and eliminating the meta-value problems, so that I can hopefully get to the one that is causing my simulation problems.  Simultaneously, I have been making small incremental changes to the VHDL and synthesising, to progressively inch towards making the audio DMA stuff actually work.  The whole idea of simulation was to avoid this slow process, but here we are with both both running neck and neck, as to which will yield results first.

One thing I realised I needed to add, is a mechanism to prevent the audio DMA from hogging the bus. To do this, I have implemented a hold-off timer, that prevents any two audio DMA cycles occurring with less than 8 cycles between them. Combined with the fact that the DMA cycles cannot interrupt a CPU instruction, this should allow for reasonable processor performance, even when the DMA rate is set reasonably high. Also, it means that we should be able to get around 2 million audio DMA operations per second.  Since we need only one audio DMA channel for the ultrasound, this should be plenty.

Well, that all took longer than intended, and I'm not yet sure that it is all 100% correct. There are some niggling CPU timing problems, that mean that the audio DMA only works reliably when the CPU is set to 40MHz, and not in the Hypervisor. More precisely, it probably will get upset if the CPU is running in anything that is not the main memory.  Anyway, for now, those are reasonable limitations that I can work within.

The audio DMA system works by adding a 24-bit fixed-point fractional increment to the sample counter. When it reaches 1, then its time for a new sample. This means that the sample rate will be CPU CLOCK SPEED * FRACTION.  If for simplicity, we call the fraction a simple 24 bit integer, and substituting the CPU speed of 40.5MHz in, we get:

SAMPLE RATE = 40,500,000 * (SPEED / 2^24)

And thus by rearranging, we can find the SPEED value required for any particular sample rate:

SPEED / (2^24) = SAMPLE RATE / 40,500,000

SPEED = SAMPLE RATE * (2^24) / 40,500,000

SPEED = SAMPLE RATE * 0.414252

This means that we can achieve sample rates all the way up to the CPU clock speed, and down to about 3Hz.  In practice, it is limited to around 1 - 2MHz due to the bus saturation limit, and also because the audio cross-bar mixer effectively places an upper-limit on the sample rate that will come out the speaker.  If the audio cross-bar is limiting the frequency too much, we can make a bypass for that, so that we can increase our upper frequency limit.

For a little test, I have made a 16-entry 8-bit Sine table, so that I can produce a pure tone for calibrating frequency and testing the volume at varying ultrasonic frequencies.  It sounds generally ok, but even I can hear that it isn't a pure tone. So I might increase the sample count and go to 16-bit samples, so that it sounds better.

Well, that didn't work.  I even tried it on the MEGAphone prototype, in case it was the audio output circuitry. However, fiddling around, I did discover something important: The distortion changes based on the code the CPU is running.  I even made a little loop that confirmed that the opcode of an instruction is ending up in the audio stream, by changing the bytes I am playing in the sample loop, and finding the silent point occurred when the bytes all matched the opcode of the loop.  More investigation reveals that it isn't just opcode bytes that can show up, but seemingly any CPU memory access.  Also, I was seeing that just occassionally the CPU would pick up a byte from the audio DMA data, and jump off into lala land as a result.

This is rather annoying and a bit worrying, as it means that there is a bigger problem with bus timing than I had expected. I already knew that I had to be careful with not allowing the audio DMAs in hypervisor mode, and only at 40MHz because of funny business. And it now seems that this problems are much more significant than I had hoped.  But because they don't show up in simulation, they are a bit of a pain to track down. 

What I might do here is a bit of a pragmatic solution, and make the bus wait an extra cycle when starting an audio DMA so that the value we want REALLY shows up, and then allow the bus to settle for a cycle back on whatever the CPU was asking for before the DMA.  Another slightly more elegant solution would be to use the dead read wait states in the CPU. But that is rather more complex to implement.

Taking a look at the synthesis logs, it looks like the audio DMA stuff has pushed the tolerance for timing closure on the memory controller in the CPU out the window -- which would explain just about everything.  The trick is how to simplify things back down, so that the logic becomes shallow enough again to get closure. 

It looks like it might be easier to use the read wait-state cycles after all.  Those were all reading from address $000002, so they were easy to find.  I've now gone through and refactored all the audio DMA scheduling into a more generic "background DMA" framework, where the sample LSB and MSB reads for the four channels are considered 8 separate DMA targets. It also means we can add other interesting background DMA actions in the future.

For not the first time in this adventure, I have had something that ALMOST works, but not quite. Sometimes it would simulate, but not synthesise due to multiple drivers, or there would be funny corner-cases where it wouldn't correctly realise when the shadow RAM was reading the correct location for the background DMA activitity.  The whole interplay for reading the shadow RAM with effectively zero waitstates is just a pain, but we have to work with it for now. 

I think I now have it so that whenever the shadow RAM is not being used, that the CPU does a background DMA read, and correctly latches the data into the audio registers.  While I was doing all that, I also overhauled the audio mixer to use signed samples the whole way through, so that mixing the audio can be done more easily, and without introducing DC biases like unsigned samples do.
It now runs under simulation, with the background DMA reads happening when the CPU reads from IO, or has a wait-state: Basically it makes the background DMA activity the default, and it is only if a different address is presented to the shadow RAM bus that it does something else.  So the moment of proof comes now, while I wait for it to synthesise again...

And again. And a few times after that. But I now finally have a nicely working DMA audio engine.  In fact, its now good enough that it can play Amiga MOD files with a crude little tracker that I wrote to test it.  Which actually works quite nicely. It can already play a reasonable variety of MOD files, but doesn't yet support most of the MOD effects -- only tempo and instrument volume.

But it does already support repeating samples, too, since I need that for the ultrasound testing, because it would be a pain to have to keep feeding sample data in. Instead, I can just play a 32-byte sine wave loop indefinitely.  The result is this:

From there, I have been working on a little test programme that just plays a continuous sine wave tone, with variable frequency.  The Audio DMA can, in theory, play a sample every tick on the 40.5MHz CPU clock. However, in practice it is limited to about 1/16th of that, i.e., about 2.53MHz. Still not bad.  Our sine wave sample is 32 bytes long, so that means we have a maximum of around 80KHz.  This is comfortably well above our target range of 20 - 30KHz -- and remember that this is not the frequency at which it can play a horrible square wave, it is the frequency at which it can play a pretty nice sine wave -- and all this without the CPU needing to do anything once we set it going.  So I'm pretty happy that we have the audio sub-system in place that will let us produce ultrasonic frequencies. 

Now, back to that test programme: It plays the 32-byte sine wave in an infinite loop, and lets us vary the frequency and volume.  It also shows a nice little oscilloscope display of what it thinks it is playing:

Here we can see it set to ~40KHz, above what we need, and the 40MHz 8-bit CPU is doing a fine job of displaying this in real-time.  It is currently fixed with respect to the sweep time, with the 256 pixel wide display corresponding to about 130 microseconds as a natural consequence of the tight sample reading loop. It took a little bit of fiddling to get the programme right, though:

First, I had to synchronise the sweep to always start at the beginning of a loop of the sample. This helps to hold the display with a fairly steady phase. 

Second, I was using DMA to update the display with a double-buffered arrangement I had made for another programme. However, that can't be used here, because foreground DMAs, such as the double-buffer copying, cause the audio DMA in the background to pause. This was causing quite horrible audio artefacts as the tone would be interrupted tens of times per second for a number of milli-seconds.  Eliminating the double-buffer and all other foreground DMAs fixed that problem.  This is one of the things about this audio sub-system that leave it clearly in the 8-bit home computer class, and not in the multimedia PC class where the Amiga lives: There is no prioritised media DMA slots to ensure stable audio at all times.  Demo writers thus get to have fun working out how they can do parts with cool DMA effects *and* still have nice digital audio playing in the background.

Third, even when I fixed those problems, there is still a funny artefact where there are spikes in the audio playback. There are some hints that this might be a glitch when the next sample begins to play. However the audio hardware is so fast -- driving the PDM at 40.5MHz, resulting in ~25ns intervals, makes it hard to capture this reliably on my oscilloscope. This causes wave-forms that look like this:

This problem is manageable, however, as the sine wave is still otherwise intact, and produces an acoustically decent tone.   The glitches do produce some audible artefacts when driving ultrasonic frequencies, but this is all at tolerable levels for the test.  So I'll worry about fixing that later, unless it does turn out to be problematic.

The next step is to add the reading of the microphone data.  Hopefully this will go smoothly, as I have already established that the microphone is sensitive well past 20KHz, and according to the data-sheet, all the way up to the maximum 79KHz that we would realistically generate.

Getting the code running on the MEGAphone prototype was easy, as it is fully compatible with the desktop MEGA65 that I have been developing it on.  The only trick was I had to remember how to control the amplifier on the MEGAphone for the speakers so that I can get the volume loud enough for testing.  This is controlled by $FFD7035 and $FFD7036, where $00 = +24dB, $40 = 0dB and $60 = -24dB and $FF = mute.  It was set to $60 by default and was way too quiet. 

I did some initial testing with $00 (+24dB) and it seems to work, although I think it might send the amplifier into over-current shutdown after a while. That would be ok for this application, as we only need a very low duty-cycle.  What is more of a problem is that at +24dB there are a lot of audible artefacts which I will need to investigate. Once the kids have gone to sleep I'll start experimenting again to see if +0dB is loud enough to be detected at a decent range.

So, in terms of initial tests, I'm initially testing only at very close range, and having to deal with some distortion caused by a bug in the audio DMA subsystem that I have yet to figure out: Basically if the CPU is running, then the waveform is distorted.  I'll likely deal with that by implementing a special "play sine wave" mode for the audio subsystem, so that we can side-step the whole DMA thing altogether. But its still good enough that we can work out where the resonances are that are good enough to receive anything.  So:

There is a peak between about 20.0KHz and 20.2KHz, that ramps up more slowly on the low side, and falls off quite quickly on the high-side.  Then there are also a few peaks in the 25KHz to 30KHz range, but they are lower than the 20.1KHz centred peak. Above 30KHz I haven't seen any signs of life as yet. But this DMA distortion problem is getting worse, the higher the frequency, so I think I need to implement the "play pure sine wave" function, and then test it again.  But I must say: I'm pleasantly surprised that the speak is in fact producing any energy at all above 20KHz.  Whether it has enough range, and whether it can do so with quiet enough audible artefacts remains an open question, though.

I implemented the sine table ROM to avoid the distortion.  There are still audible artefacts, presumably because the MEGA65's audio output is PDM, i.e., lots of 1s and 0s rather than a true sine wave.  As side-effect of fixing this is that the peak response shifted down from ~20.1KHz to around 18KHz.  There is some response up around 26KHz-30KHz, but it is quite weak. 

It looks like the next step required will be somehow filter the signal, so that we get closer to a true sine curve.  The data-sheet for the amplifier suggests using ferite beads on both leads to the speaker, and a 470uF capacitor to GND on the high-side.  At that point, though, we cease to be working with the existing MEGA65 hardware without modification, and we are already well past the scope of this work unit.

What I will still do, is attempt to more adequately quantify the range of ultrasonic frequencies that respond, and what the fall-off looks like, so that we can identify the likely usable frequency range for communications.  Minor speed-bump, though, in the form of the speaker lead breaking:

Fixed that. Next step was to improve the ultrasound test progamme to show some kind of frequency domain break-down.  An FFT would be idea, but a fair bit of work to produce, when I just want to be able to see if there is noticeable energy at a frequency. So I made a simple programme that superimposes the sample train over itself at all possible time deltas, and then measures the energy of the result.  If the delay corresponds to a periodicity of the signal, then it will result in a higher amplitude signal.  I added a bit of filtering to it to deal with harmonics etc, and give or take a few wrinkles, it produces fairly clean peaks for the frequencies at which energy is present:

The read peaks correspond to the periods at which energy is present, i.e., further left is higher frequency, and further right is lower frequency.   The very tall line-like peak on the left edge is an artefact of measuring energy by superimposing waves and looking fo reinforcement, as it will basically pick up when adjacent samples don't zero-cross.  Thus it should be ignored. But we can see the next peak corresponds to 1 cycle of the ~76KHz resonance of the microphone. The other peaks come and go a fair bit, and are less robust, and may correspond to some weaker lower frequency enveloping of the 76KHz resonance.

So I think we are finally set to try various frequencies and measure at which frequencies we reliably see energy, and to delineate that set of frequencies as those which are potentially usable for ultrasound communications on the existing MEGAphone hardware, and thus to give us sufficient characterisation of the available channel bandwidth.

I'll work with the amplifier set to $40, which is some way off the loudest setting, and the speaker leaning against the MEGAphone prototype about 1cm away from the phone.  If we can't pick up a signal under those conditions, then we'll assume it won't be usable. But if we can, then we can repeat the test with some increased distance and power, and see what kind of range is possible.  To support this, I have improved my test programmes to work out the period of the sample frequency requested, and to calculate a running average of the estimated power at that frequency.  In this way, I have an objective and quantitative means of comparing the power at different frequencies, even if the units are undefined because of the estimation process.  The display with this looks like this:

The light blue vertical line is the period we are looking for. This corresponds to exactly one full wave of the tone being played (visualised by the white points). The received return via the microphone is the yellow points, and the red line is the power at each period.

So we can see visually that there is some power at the target frequency here visually, as there is an observable periodicity in the yellow waveform, which is then reflected by the red peak at that point. The running average is then calculated and displayed in the lighter red coloured text. For each sampling, I will collect 63 samples.  I'll also do a quick variance check by doing three successive collections with a single setting, so that we can get an estimate of how reliable the readings are. It was quite pleasing to be able to implement all of this directly on the MEGAphone's 40MHz 8-bit CPU.

Okay, so having batched a bunch of runs at different frequencies, we find the following:

1. Below ~19.4KHz the waveform is very loud.
2. Between ~19.4KHz and 19.7KHz it's not quite as loud, but still very loud, perhaps 1/2 as loud as just below 19.4KHz.
3. From 19.8KHz to 20.4KHz it's about 1/2 as loud again, with relatively flat response.
4. From 20.4KHz to about 21.4KHz it is about as loud as between 19.4KHz and 19.7KHz again.
5. From 21.4KHz it drops off very sharply, with no discernible signal beyond 21.7KHz.

So assuming we want to keep high enough frequency to not annoy people, there is probably about 1.4KHz centred on 20.7KHz that would be usable.  As tempting as it would be from a link-margin perspective, dropping down to below 19.4KHz is probably not feasible.

Increasing the amplifier level from 0dB to +12dB yielded a discernible waveform at distances of around 30 - 40 cm at 19.4KHz -- i.e., where the performance of the system is very good.  The amplifier can go to +24dB, which would thus be expected to perhaps deliver the ~1.5m range, but at a power consumption during transmission of greater than 2W.  It seems rather unlikely that 1.5m range would be obtainable in practice over the whole band, although somewhat shorter range may well be possible, or it may be possible to increase the amplifier power further.

But whatever the frequency and amplifier arrangement, some kind of filtering is going to be required, to prevent the audible artefacts that are present across the whole frequency range tested.

That said, we have clearly proven that the microphones we are using are sensitive to ultrasound. Thus it could be possible to use a smaller speaker with improved ultrasonic properties to further boost performance.

But as I reflect on all this, it seems to me that there are a bunch of problems, that together make me a little sceptical about the utility of such a system in practice:

1. Transmit power of 2W or more is likely required to produce a useful range over a wide-enough ultrasonic band.  Even at a low duty-cycle, this will create a significant power consumption.
2. The speakers we use, and most ultrasonic speakers, are rather directional, making detection of proximity rather unreliable. It was quite fiddly to get the results described above, demonstrating that this is not just a theoretical problem.
3. Point (2) is made worse if you have the device in a bag or pocket.
4. The audible artefacts are REALLY annoying.
5. Some people can likely hear up around 20KHz, and even the occasional short chirp of a packet is going to be very annoying to listen to.
6. Bluetooth is already widely available, requires no new hardware, and for all its problems, has much better performance than I have been able to observe here. In particular, the lack of risk of audible artefacts seems compelling in the circumstances.
7. The need for contact tracing apps seems to have somewhat dissipated, at least for the time being.

However, as a further communications channel between devices when faced with a hostile RF environment, it would seem to have potential.  In that case, it would likely be possible to mask the audible artefacts by, for example, playing music or other sound while the ultrasonic transfer occurs.  In this context the difficulty of creating an ultrasonic signal that can travel great distances becomes a strength, because it means that an effective jamming effort would require considerable proximity. In contrast, 2.4GHz Bluetooth or Wifi are rather easy to jam from a distance. 

Thus, while this wasn't the primary objective of this investigation, it has revealed that such near-ultrasonic communications is quite possible using rather conventional components found in smart-phones and similar devices.  It also means that from a privacy perspective we must take care, as it is similarly possible for devices to communicate via ultrasound without a user's knowledge, e.g., to exfiltrate data across air-gaps.

Wednesday, 17 June 2020

Pre-Ordering for the MEGA65 Developer's Kits (DevKits) is nigh!

Goodness me, it's been a long time that we have been working on the MEGA65 nowhelp. But we are finally at the super-exciting end of things where things are starting to happen more quickly.  And that includes this week, when pre-ordering of the DevKits will open over at Trenz Electronic.

So let's talk about what the DevKits are, and what the price will be. 

But before I continue, I just want to be careful to explain that the pricing of the DevKits is not representative of what the price of the final machine will be.  There are a number of reasons for this: 

First up, at this early stage of things, we don't have the price of the various components optimised.  We are buying smaller quantities, and learning how to do everything thing right. That all translates to increased costs for the DevKits. We hope to be able to sell the final machines for less than it costs us to manufacture the parts for a DevKit, for example.

Second, the DevKit release is partly to help early adopters get hold of the machine and start writing software, documentation and other goodies for the community, and partly to provide us the cash-flow to get the production machines out ready as soon as possible.  This means covering the cost of designing and printing the packaging, user guide and other goodies.  While we are already well advanced on some of that, there are some very real costs that we need to cover.

Third, somewhat counter-intuitively, the acylic cases of the DevKits are probably more expensive than what the injection moulded cases of the production machines will cost.  This all comes back to the wonder of injection moulding: You pay a fortune up front, and then you can produce the best cases at the best price for a long time after.  That's where we are aiming to be for the production machines.

All up, expect that the DevKits will be quite noticeably more expensive than the production machines.  Also, remember that the MEGA65 is an open-source computer being produced on a non-profit basis: None of the MEGA65 team earn anything from the sales. It all goes to cover costs and support the completion of the machine.

But let's get back to talking about the DevKits: One of our big goals is to increase the number of people who can develop on, and contribute to the MEGA65.  We see this as a once-in-a-generation opportunity to help shape and be part of the story of The Last 8-bit Computer that never was, and now will be.  There may be other 8-bit systems in the future, but as the spiritual successor and completion of the C65, we think the MEGA65 has a pretty special role to play -- and we'd love to have more folks help us make it as awesome and exciting as we can. 

We want it to be a machine with a variety of software and good compatibility and rock-solid performance when it is ready to arrive under people's Christmas trees, birthday present piles and Retro Rooms, so that we can recreate that "Christmas 1982" feeling one more time for the community. To achieve this, we need as many people to contribute in a variety of ways, whether helping with documentation, C65-fixing existing software, writing new games, programmes and tools, or contibuting to the VHDL and operating system software of the MEGA65 itself. 

It's a radically different model to most computers around today, that are more like mountains upon which we gaze, or perhaps at most seek to climb.  But we want the New Zealand model, where geography isn't just something you look at, but is rather more of a participatory sport.  So to with the MEGA65, it's by participating in the story that you have the most fun, and can share the most joy with the community. Come. Be part of the story with us.

Frequently Asked Questions

Q: What's a DevKit?
A: A development kit aimed at developers so they can start coding software for the machine and even help shaping the final product before it is released.

Q: Why does it look so different?
A: The case is made in a way it can be produced in small batches before injection moulds are finished. Its transparency helps finding out if the smoke stays inside the chips.

Q: I am a collector not a developer.
A: The DevKits have laser-engraved Logos and serial numbers to make them unique. DevKits usually are great collector's items.

Q: I only want to play with it!
A: The DevKit is like a "real" MEGA65 only in a preliminary form. You might encounter hiccups but you can always (soft-)update it.

Q: I do not like the floppy.
A: The DevKits come as (hence the name) kits which include a refurbished floppy drive. Feel free to leave that out and donate it to us.

Q: What will it cost?
A: DevKits are always more expensive than mass-produced machines, they also get strong support from the makers. The MEGA65 DevKit comes with a price tag of EUR 999 which is in fact very low considering the cost of the components, support and general preparations required for this initial production run of machines, as well as our costs of getting the final machines ready for release. The final machines will benefit from these things, and improved economies of scale, which will allow it to be released at a lower price.

Q: Can you build it for me?
A: It's really easy to build, usually under an hour, maybe a bit more if you are clumsy. If you do not dare to build it yourself please ask in our support forum or via the other communication channels you will get access to. There are many nice people around!

Q: I am more interested in the final MEGA65 and not really a developer, how can I support and improve the development of the final machine?
A: Please buy a DevKit and lend/donate it to a talented developer!

Q: When can I buy it?
A: From tomorrow on, but do not wait too long!

Q: I am a blessed developer and want to sacrifice all my time but I do not have any money!
A: Please talk to us about support!

Q: If I buy a DevKit, can I transfer the PCB and keyboard into a MEGA65 case later? Can I even 3D-print my own MEGA65 case?
A: Most probably yes! But we can’t guarantee it.

Friday, 12 June 2020

Fit testing the injection moulded case sample

Another post where I promise to be short on the words, so that you can just enjoy seeing the MEGA65 hardware become a reality:  This time we have the components all being put together to make sure that everything fits.

These photos are hot off the press -- as you can read in the labels on the case parts, they were only manufactured on the 8th, and its now early on the 12th.  I'll wait to hear from the team if any problems have been spotted, but so far it looks fantastic to me.  The keyboard has enough clearance around all the keys so that none should jam (a problem on the real C65 prototypes, where the cursor keys and return were rather problematic).

But as promised, I'll now shut up, and let you enjoy the eye candy.

Saturday, 6 June 2020

Fixing some floppy bugs

Among everything else, we have been looking at some bugs with the MEGA65's internal floppy controller.  It was working most of the time, but would hang in various situations.

The first problem was that it would hang during loading files a long way from the directory track.  I was worried at first, that it was some problem with the MFM decoding not being good enough.  So I wrote a nice test programme that  reads some MFM decode debug info, and shows a histogram of the gap sizes.  This should result in 3 very clear peaks corresponding to the different bit gap lengths that MFM produces. As the test disk I have here is empty, it's quite heavily skewed, but it is still clear that the peaks are there, are well spaced, and nice and narrow:

The third peak here is really just a little blip, because of the disk being empty. But watching multiple frames, I could see that it is there and real. The colours really just indicate the height of the lines.  The left edge of the chart is shorter intervals, and the right side longer intervals.

I'm actually really happy with this nice little tool, as it runs continuously, and you can swap disks etc, and see the content change.  With a formatted disk, it does several frames per second.  This is of course running natively on the MEGA65.

The video mode is 640x200 using a combination of normal text and 16-colour text mode, where each nybl of a character byte encodes one pixel.  This means the whole screen fits in 640x200x0.5 bytes = ~64KB.  Being able to mix normal chars in makes it much easier (and faster) to draw text over the display.  This all contributes to the quite fast performance, even though I wrote it all in CC65, which while quite handy, doesn't really produce particularly fast compiled C code.  One day we will teach it some of the 4510 and 45GS02's tricks to produce MUCH faster output, but that will have to wait for another day.

Meanwhile, if you are curious what the distribution of an unformatted track looks like, here is an example:

We still see indication of the first and second peaks, perhaps because of some factory formatting artefact or something, or from whatever else was on these disks previously.  But we see the distribution is continuous, and thus it isn't really possible to classify any given sample with certainty. The drop-off on the left edge is presumably due the limit of the magnetic medium.

I find the whole low-level signal processing side of floppies is quite fascinating.  One day when I have time, I want to see just how much data I can cram on a 720K or 1.44MB floppy using modern RLL(2,7) encoding, a single really long sector, variable write speed per track to match the varying linear velocity of the tracks, and using modern error correcting codes to enable us to tolerate some errors.  My gut feeling is that at least double the capacity should be possible.  But that, also, will have to wait for another day.

Anyway, having confirmed that the floppy was being read reliably, I started implementing a random track seek function, so that I could see if it was the seeking that was the problem.  And indeed it was: Sometimes the drive would seek either one track too few or one track too many.

I thought about a few different ways to solve this problem. In the end, I opted to include a feature that makes it easier to use the controller:  If the MFM decoder spots a sector on the track under the head, and it doesn't match the track we are expecting, the controller will step the head one track in the correct direction. It's a bit like an auto-tuner for a celebrity who can't reliably stay on the correct notes, but for floppy drives.

This is nice from a programmer's perspective, because you don't have to step the drive to the right track before scheduling a read or a write. It can still be turned off, if you don't want it, but for most use-cases, its probably a good idea.

With auto-tune implemented, the tracking was now quite reliable.  That fixed the problem of loading files that were a long way from the directory track.  However, loading big files would sometimes hang, and Falk working on the MEGA65 GEOS port was also having drive lock-up problems.  So I enhanced the floppy test utility to include a looping read test.  This reproduced the problem, with the test locking up after random amounts of time.  It would also hang completely if the drive was on an unformatted track, or no disk was insterted.

So I went through the read timeout logic with a fine-tooth comb, and found some corner cases and fixed them.  That got it working nicely. Here is the read test working:

The two-tone green is just so that you can more easily work out which track is involved. Track 0 is on the left, and it will try up to track 85, just because I felt like it.

In the process of this, I also discovered that you can't really trust the side byte in the sector header of disks formatted in a 1581, so I modified the controller so that it only checks the track and sector match.

There will probably be a few more wrinkles to sort out in all this, but its a nice step forward.

Injection Moulding Tooling Update

Just a quick post to give you all an update on the tooling for the MEGA65 case.  As most of you will probably already know, we are having real injection-moulded cases for the MEGA65, thanks to the generosity of the community who collectively donated the cost of producing the moulds -- a total of some 66,000€!

So, we are now at the point where they can do test-runs with the tools.  This will be done to produce a few pieces for fit-testing with the motherboard and keyboard. It will also be used to look for problems in the injection moulding process, e.g., if the plastic doesn't flow to every corner properly, or there are visible artefacts of the plastic flow slowing or doing other strange things. 

There will be more testing, and then finishing touches, like sand-blasting the texture into the mould cavities.  The coloured plastic for the production parts will also be used at the end. In the meantime it is easier for the factory to just use the naturally coloured plastic during testing, so that they don't have to store or throw away plastic they can't use for other things in the meantime.

Anyway, enough blabber from me -- enjoy the photos and short videos showing the tools in action!

Sunday, 24 May 2020

Improving the MEGA65 Audio Mixer

This past week we have been working on improving some of the cross-development tooling that we use, which I will write about soon. But for now there is nothing really visible to show.  So I will instead share with you the work I did today on getting the MEGA65 Audio Mixer that is accessed from the Freeze Menu more user friendly.

We now have some nice decibel logarithmic volume scales, and a bunch of friendly controls for controlling overall volume level, as well as selecting stereo/mono, and swapping left/right channels.  You can also use the up/down cursor keys to pick a particular element and left/right cursor keys to change the values.    You can also press T and it will play a few musical notes through the four SIDs, alternating left and right, so that you can more easily verify that the settings are sensible, and not left/right swapped.

All up, it now feels like an interface that the average user could use, especially compared with the previous proof-of-concept audio mixer. Just in case you can't remember or haven't seen just how "user feindly" the previously one was, here is a reminder:

While it is technically more powerful (as you can set every coefficient in the full-cross-bar audio mixer separately), you really need to know what you are doing to avoid instant and prolonged confusion.

Wednesday, 20 May 2020

Injection Moulding Tooling Progress

A quick post just to share the exciting progress with the injection-moulding tool manufacture for the MEGA65 case.  The major pieces of the tool have now been produced, as you can see below!

First, here we see the inside of the lower case. The tools are all in negative profile. The big round holes are for the pushers for pushing the cases out of the tool. The smaller round holes are all for screw bosses -- so in addition to the bosses for the motherboard, there are plenty of extra ones for expansion boards, optional internal speaker etc. You can also see the MEGA65 team and major donors for the tooling costs here, so that they will appear as part of every case we produce. We are very proud to recognise some of the key contributors to the creation of the MEGA65 in this way, and wish we had limitless space to be able to recognise everyone who has been involved in the project.

Then we have the outside of the underside, where we can see the hole for the trap-door, and the logos for M.E.G.A. and Hintsteiner.  Note that it seems to be missing 2 of the sides. Those are the sides where all the ports go.  Those sides comea in as separate pieces that move in before the plastic is loaded, and then pull out the way before the pushers push the finished part out of the mould. 

Then for the upper half of the case we have the same inner and outer pieces, although this time quite a bit simpler. The funny sword looking shape in the middle is where the plastic comes in to distribute around, and to hold the piece straight as it comes out warm.
And of course the opposite side of that.  Here we also see the really big diameter holes where the rods go through on which the mould will slide together and apart each time a piece is moulded.  The lower half of the case will have something similar, but of course more complicated because of the pushers that come in from the sides.

Then we have the relatively simple trapdoor slot and eject button on a single little family mould. These can go on a single mould because they are similar enough in volumae that they can be balanced within the mould, to obtain good results.

 Here is the other part for this last mould still being prepared:

While I haven't had the chance to discuss it with them directly, my understanding is that the next stage is the assembling of the tools with all the big rods that hold them together etc, and making sure everything fits together properly.   Then it will be time to load it into a moulding machine at the tooling factory and test it out.  

Once they are happy at the tool factory, they will ship it to the factory where it will actually be used in the production injection moulding machines. They will then go through a commissioning process, where they will fiddle with pressures, gate settings etc to make sure the plastic flow fills every little bit reliably (a bit of a black art), and try to minimise the time per part, so that we optimise the price per case (you pay by the second that the machine is busy). All that will take another month or three, depending on how COVID19 affects everything.

We'll let you know more as soon as we know it, but it's already very exciting :)

Sunday, 10 May 2020

More work on the RAM Expansion and Timing Closure

I recently got the MEGA65's 8MB built-in expansion RAM mostly working.  I say mostly working, because it was still doing various strange things sometimes.  So I have been spending all my spare time trying to get this sorted.
While in theory the RAM controller is simple, in order for it to offer decent performance, it ends up a bit more complex.

Let's start with the simple HyperRAM itself: This can be driven by a clock of upto 100MHz, allowing for DDR throughput of upto 200MB/sec.  Except that it cant.

First, we don't really have the means to use a 100MHz clock, because it's not a clean multiple of any of the clocks we are using. So we are instead using 81.5MHZ, which is 2x the MEGA65 CPU clock and 3x the VIC-IV's pixel clock.

Second, the HyperRAM is accessed by first sending it a command to read or write something.  Regardless of the clock rate, there is a minimum of 40ns plus a bit between transactions.  In reality, it takes about 100ns -- 200ns to actually perform a random read or write. 

Third, in order to not drag the maximum clock speed of the MEGA65's CPU down, the expansion RAM is on the "slow devices" bus.  They don't really have to be slow, but the CPU requires at least 2 clock cycles to access such a device.  That bus then asks the HyperRAM for data, which takes at least another 2 clock cycles.  Thus there is something like 100ns extra latency from the memory architecture of the MEGA65.

The result is that a naive implementation will have about 300ns latency, i.e., about what a real C64 in 1982 had with its ancient DRAM chips.  This is enough for 2 - 3MB/sec -- which with a 40MHz CPU is a bit disappointing.

Thus the HyperRAM controller and related bits and pieces in the MEGA65 does a number of tricks to hide this latency, and increase the throughput for linear accesses. We are concentrating on linear access, as we figure it is most likely how the expansion RAM will be used: You will either be copying things in or out of it.  Some of those tricks are:

1. Two eight-byte write buffers, that allow most writes to happen with just 2 cycle latency, unless the buffers are already busy. Each buffer covers an 8-byte aligned region of 8-bytes, and multiple writes to the region in quick succession will result in a merged 64-bit write to the expansion RAM.  This speeds up writing a lot.

2. A similar setup of two eight-byte read buffers that you could call a tiny cache.  Again, this helps avoid the 100 -- 200ns latency of the HyperRAM, but not the latency of the slow device bus and related bits.

3. A block-read mechanism that reads and buffers upto 32 consecutive bytes from the HyperRAM, and feeds them pre-emptively to the slow device bus as an 8-byte buffer. This really helps reduce read latency.

4. A pre-emptive pre-fetch mechanism in the slow device bus that pushes the byte after the previous read one to the CPU, so that if it is doing a linear read, it can avoid all but one cycle of latency in the slow device bus, and theoretically allow reads at up to 20MB/sec.

I have also added a priority access mechanism to the HyperRAM, so that we can have automatic chipset driven DMAs like access, a little like the Amiga, so that we could, for example, feed the digital audio channels directly, without any CPU intervention.  I have yet to work out exactly what that will look like, but it is important to get the interface in now, so that it can be worked on later.

So all of this mostly works, but there are various little corner cases.  Also, the timing closure on the MEGA65 as a whole was getting a bit out of hand again, so I have spent quite a lot of time trying to flatten logic, so that everything that needs to happen in a single clock tick can in fact happen. And reliably.  This also can help synthesis to run a bit faster, because it doesn't have to try so hard to make everything fit with acceptable timing.  But its also important when debugging weird behaviours, so that I can be confident it isn't just because everything is violating timing by a bit.

I now have it mostly meeting timing, and all of the above optimisations and features at least mostly working.  Where I am now upto is that there are some weird bugs with the chipset DMA, and chaining of the 32-byte fetches and the like.  Also the optimisation in (4) is not completely implemented.

First, I'll tackle the weird bugs with the DMA. Basically the HyperRAM controller can get confused and freeze up, messing up both the chipset DMA, as well as regular CPU access.  I think much of the problem is how it handles the situation where a request comes in from the slow_device bus at around the same time as a chipset DMA. The chipset DMA is supposed to take priority, since it might be quite timing sensitive.  This means that the incoming request from the slow device bus has to be remembered for later.

This is further complicated by the fact that the HyperRAM controller has an 81.5MHz and a 163 section, and we have to be careful how we pass information between the two.  Normally the 81.5MHz side registers and processes the requests from the slow device bus. But when the HyperRAM is actually being accessed, it is the 163MHz part of the circuit that comes into play.  The trouble comes when the 163MHz side realises that it can service a newly arriving request, which it should do to reduce latency.

It should also tell the 81.5MHz that it has processed the request, so that it doesn't get processed twice. And this seems to be where it messes up at the moment: The request does in fact get processed twice.  This probably means I need to have a signal that tells the 81.5MHz side when a latches request has in fact already been processed.  Ok, implemented that, and it looks to simulate properly.

Next is to finish implementing the pre-emptive delivery of data to the CPU, so that it saves time every time it can, instead of just sometimes. Basically now it will save time at most 1/2 the time, because it supplies the next byte whenever the CPU requests a next byte, and it can.  For a linear read, that's only 3 out of every 8.  As we have the cache to work with, we have an 8-byte aligned block, and we should be able to pre-fetch 7 out of every 8 bytes in a linear read. We just need to watch for when the CPU consumes the previous pre-fetched byte (which it already tells the slow device bus) and get the next byte ready in that case. This all happens in slow_devices.vhdl.

After that, it was time to look at when we flush out the 8-byte write buffers. On the one hand, we want to not abort early if the last few byte positions don't need writing, in case a new write comes along that immediately follows in memory, and thus could be chained. But on the other hand, if all our write buffers are currently busy, such that we can't accept any new writes, then we really want to abort as quickly as possible, so that we can start processing any new requests.

I have a simple sequence of memory accesses in a test harness that I use to verify that each of these changes has not broken anything, i.e., we don't lose any writes or read stale data etc.  As I have been working on these various optimisations, the average transaction time has dropped from around 120ns to around 68ns -- so all up, this is quite a significant performance boost.  This should allow a hyperram access to, on average, occur in around 3 clock cycles, with linear reads potentially only requiring 2 clock cycles.  Thus, I'm hoping that when I run my speed test programme, it will show "Copy Slow RAM to Chip RAM" speeds of ~40/3 = ~13,000KB/sec. 

In generally, looking at the test harness output, things look pretty good now for the most part. The only exception is that a few memory accesses seem to take a REALLY long time, longer than should really be possible, up to around 450ns.  Something pathological must be happening there, such as an aborted 32-byte block read that would have got the data, but is aborted, and then a new fetch is started that does get it.

Looking into it, it might just be when there are a lot of writes banked up that need flushing.  But I did notice a weird thing, where data was being spread into both of the 8-byte read caches, instead of all going into a single one.  While that can't cause the current problem, it would still be good to squash it, as otherwise it will cause havoc with the performance of the cache when reading, as one of the cache lines will just get ignored in all likelihood.

Fixing timing closure is often about simplifying IF statements in the VHDL, so that they can be processed more quickly by the hardware.  Sometimes this can be through reducing the number of nested IF statements. Other times it is about pre-caculating expressions that appear in the IF statements, and then using the simple pre-calculated true or false result from that.

Vivado does give nice reports about signals that don't meet timing closure, like this:

Slack (VIOLATED) :        -1.245ns  (required time - arrival time)
  Source:                 hyperram0/block_valid_reg/C
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Destination:            hyperram0/current_cache_line_drive_reg[3][5]/D
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Path Group:             clock162
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            6.154ns  (clock162 rise@6.154ns - clock162 rise@0.000ns)
  Data Path Delay:        7.383ns  (logic 1.200ns (16.253%)  route 6.183ns (83.747%))
  Logic Levels:           6  (LUT4=1 LUT6=5)
  Clock Path Skew:        0.027ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    6.300ns = ( 12.454 - 6.154 )
    Source Clock Delay      (SCD):    6.600ns
    Clock Pessimism Removal (CPR):    0.327ns
  Clock Uncertainty:      0.073ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.128ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clock162 rise edge)
                                                      0.000     0.000 r 
    V13                                               0.000     0.000 r  CLK_IN (IN)
                         net (fo=0)                   0.000     0.000    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.521     1.521 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.233     2.755    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
    MMCME2_ADV_X0Y0      MMCME2_ADV (Prop_mmcme2_adv_CLKIN1_CLKOUT5)
                                                      0.088     2.843 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           2.018     4.861    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
                        net (fo=1126, routed)        1.643     6.600    hyperram0/clock162_BUFG
    SLICE_X15Y57         FDRE                                         r  hyperram0/block_valid_reg/C
  -------------------------------------------------------------------    -------------------
    SLICE_X15Y57         FDRE (Prop_fdre_C_Q)         0.456     7.056 f  hyperram0/block_valid_reg/Q
                         net (fo=43, routed)          1.406     8.462    hyperram0/block_valid_reg_0
    SLICE_X21Y49                                                      f  hyperram0/current_cache_line_drive[0][7]_i_5/I3
    SLICE_X21Y49         LUT4 (Prop_lut4_I3_O)        0.124     8.586 r  hyperram0/current_cache_line_drive[0][7]_i_5/O
                         net (fo=135, routed)         1.242     9.828    hyperram0/current_cache_line_drive[0][7]_i_5_n_0
    SLICE_X11Y39                                                      r  hyperram0/current_cache_line_drive[3][5]_i_7/I5
    SLICE_X11Y39         LUT6 (Prop_lut6_I5_O)        0.124     9.952 f  hyperram0/current_cache_line_drive[3][5]_i_7/O
                         net (fo=1, routed)           1.245    11.197    hyperram0/current_cache_line_drive[3][5]_i_7_n_0
    SLICE_X15Y55                                                      f  hyperram0/current_cache_line_drive[3][5]_i_5/I0
    SLICE_X15Y55         LUT6 (Prop_lut6_I0_O)        0.124    11.321 f  hyperram0/current_cache_line_drive[3][5]_i_5/O
                         net (fo=1, routed)           0.803    12.124    hyperram0/current_cache_line_drive[3][5]_i_5_n_0
    SLICE_X9Y63                                                       f  hyperram0/current_cache_line_drive[3][5]_i_3/I1
    SLICE_X9Y63          LUT6 (Prop_lut6_I1_O)        0.124    12.248 f  hyperram0/current_cache_line_drive[3][5]_i_3/O
                         net (fo=2, routed)           0.908    13.156    hyperram0/current_cache_line_drive[3][5]_i_3_n_0
    SLICE_X5Y71                                                       f  hyperram0/current_cache_line_drive[3][5]_i_4/I4
    SLICE_X5Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.280 r  hyperram0/current_cache_line_drive[3][5]_i_4/O
                         net (fo=1, routed)           0.579    13.859    hyperram0/current_cache_line_drive[3][5]_i_4_n_0
    SLICE_X3Y71                                                       r  hyperram0/current_cache_line_drive[3][5]_i_1/I4
    SLICE_X3Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.983 r  hyperram0/current_cache_line_drive[3][5]_i_1/O
                         net (fo=1, routed)           0.000    13.983    hyperram0/current_cache_line_drive[3][5]_i_1_n_0
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/D
  -------------------------------------------------------------------    -------------------

                         (clock clock162 rise edge)
                                                      6.154     6.154 r 
    V13                                               0.000     6.154 r  CLK_IN (IN)
                         net (fo=0)                   0.000     6.154    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.450     7.604 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.162     8.766    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
                                                     0.083     8.849 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           1.923    10.772    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
    BUFGCTRL_X0Y4        BUFG (Prop_bufg_I_O)         0.091    10.863 r  clock162_BUFG_inst/O
                         net (fo=1126, routed)        1.590    12.454    hyperram0/clock162_BUFG
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/C
                         clock pessimism              0.327    12.781   
                         clock uncertainty           -0.073    12.708   
    SLICE_X3Y71          FDRE (Setup_fdre_C_D)        0.031    12.739    hyperram0/current_cache_line_drive_reg[3][5]
                         required time                         12.739   
                         arrival time                         -13.983   
                         slack                                 -1.245   

These are great in that they tell me from which signal to which signal the problem is, and how bad it is. But they are also a pain, because they don't actually tell you which lines of code is causing the path. So you have to hunt through the source trying to find the sequence of nested IF statements that apply, but which might be spread over thousands of lines of code.  It's a bit of a pain.  So I made a really quick-and-dirty tool that lets me see all the IF statements that lead up to a given signal in the source.

So in the case above, we can see:

 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 635            if (block_valid='1') and (address(26 downto 5) = block_address) then
 650              if (address(4 downto 3) = "11") and (flag_prefetch='1')
 658 >>>             ram_address <= tempaddr;
 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 826            elsif address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 883            elsif request_accepted = request_toggle then
 887 >>>           ram_address <= address;
 387        if rising_edge(pixelclock) then
 899          elsif queued_write='1' and write_collect0_dispatchable='0' and write_collect0_flushed='0'
 922          elsif (write_request or write_request_latch)='1' and busy_internal='0' then
 929            if address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 972              if cache_enabled = false then
 980 >>>             ram_address <= address;

I now have line numbers, and can see exactly what is going on.  Because it also shows me the rising_edge() statement, I can also tell under which clock it is .  So in this case, it's likely line 635 that is the problem.  It's also showing me that while this IF does depend on the block_valid signal, it also compares a 22 bit address -- that's the thin that's really going to be slow.  Thus I get more useful information than Vivado otherwise gives me.  I'm sure there are expensive professional tools that can do some of this, but it was faster to cook this up than to even find out if such tools exist.  It would be nice if the VHDL mode in emacs could display something like this, though.

Anyway, in this particular case, I can't do a great deal about it, unless I add an extra stage of latency to all hyperram reads, as pre-calculating the result of the comparison would add a cycle of delay.  This is the whole trade-off of clock speed and pipeline depth, which the Pentium 4 famously did really badly.  In this instance, I have fixed a whole lot of other timing violations, in the hope that Vivado can do better with this particular signal.  I have hope, because while the total path takes too long, only 1.2ns of that is logic -- the other ~6ns is routing overhead.  By making it easier for Vivado to route other signals, it can often enable it to get better closure on critical paths like this one.  Anyway, we will know in about half and hour when it finishes synthesising.

So, that's helped me get quite close to timing closure, but there are a few signals that are not quite meeting timing closure. My next step was to fix the clock frequencies we are using. They are approximately right, but not exact. They are in fact a little too fast. This can also cause us trouble with the HDMI output, so it makes sense to get these fixed. I recently worked out how to better control clocks. Basically the problem is that each clock generator in the FPGA can only take the input clock and multiply and divide it by a fixed range of values. Thus you can only generate a limited set of frequencies.

But you can chain the generators to take the output of another one as its input, and thus generate many more frequencies.  The trouble then, is that you need to consider a LOT of possible combinations. Billions and billions of combinations. So I made a little program that tries all the possible values, and looks for the one that can get us closest to the target frequency. Here's the important part of it:

  float best_freq=100;
  float best_diff=100-27;
  int best_factor_count=0;
  int best_factors[8]={0,0,0,0,0,0,0,0};

  int this_factors[8]={0,0,0,0,0,0,0,0};
  // Start with as few factors as possible, and then progressively search the space
  for (int max_factors=1;max_factors<4;max_factors++)
      printf("Trying %d factors...\n",max_factors);
      for(int i=0;i<max_factors;i++) this_factors[i]=0;
      while(this_factors[0]<uniq_factor_count) {
    float this_freq=27.0833333;
    for(int j=0;j<max_factors;j++) {
      //            printf(" %.3f",factors[this_factors[j]]);
    //        printf(" = %.3f MHz\n",this_freq);
    float diff=this_freq-27.00; if (diff<0) diff=-diff;
    if (0&&diff<1) {
      printf("Close freq: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.3f MHz\n",this_freq);
    if (diff<best_diff) {
      for(int k=0;k<8;k++) best_factors[k]=this_factors[k];
      printf("New best: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.6f MHz\n",this_freq);
      printf("          ");
      for(int j=0;j<max_factors;j++) {
        float uf=uniq_factors[this_factors[j]];
        int n=0;
        for(n=0;n<(1<<14);n++) if (uf==factors[n]) break;
        if (j) printf(" x");
        printf(" %.3f/%.3f",(1.0+((n>>0)&0x7f)/8),(1.0+((n>>7)&0x7f)/8));

    // Now try next possible value
    int j=max_factors-1;
    while((j>=1)&&(this_factors[j]>=uniq_factor_count)) {

Basically it tries all possible values, and reports each time it finds a better one.  After optimising it to only consider the unique frequency multipliers that each generator can produce, it takes less than a minute to find an exact match:

Calculating set of adjustment factors...
There are 159 unique factors.
Trying 1 factors...
New best:  1.000 = 27.083334 MHz

Trying 2 factors...
New best:  2.333 0.429 = 27.083332 MHz
           7.000/3.000 x 3.000/7.000

New best:  0.923 1.077 = 26.923079 MHz
           12.000/13.000 x 14.000/13.000

New best:  0.929 1.071 = 26.945152 MHz
           13.000/14.000 x 15.000/14.000

New best:  0.933 1.067 = 26.962965 MHz
           14.000/15.000 x 16.000/15.000

Trying 3 factors...
New best:  7.000 0.133 1.067 = 26.962967 MHz
           7.000/1.000 x 2.000/15.000 x 16.000/15.000

New best:  1.500 0.818 0.812 = 27.006392 MHz
           3.000/2.000 x 9.000/11.000 x 13.000/16.000

New best:  1.500 1.182 0.562 = 27.006390 MHz
           3.000/2.000 x 13.000/11.000 x 9.000/16.000

New best:  1.667 0.556 1.077 = 27.006176 MHz
           5.000/3.000 x 5.000/9.000 x 14.000/13.000

New best:  1.667 0.778 0.769 = 27.006174 MHz
           5.000/3.000 x 7.000/9.000 x 10.000/13.000

New best:  1.667 0.385 1.556 = 27.006172 MHz
           5.000/3.000 x 5.000/13.000 x 14.000/9.000

New best:  0.600 1.800 0.923 = 27.000002 MHz
           3.000/5.000 x 9.000/5.000 x 12.000/13.000

New best:  0.800 1.800 0.692 = 27.000000 MHz
           4.000/5.000 x 9.000/5.000 x 9.000/13.000

These factors are in addition to the existing 100MHz x 8.125 / 30 that I am already doing, which results in 27.083333 MHz.  I was a little bit surprised that it was able to find an exact match using only three factors. It isn't even just that it is correct to 6 decimal places. It really is exact:

100 x 8.125/30 x 4/5 x 9/5 x 9/13  MHz
= 100 x 65/240 x 36/25 x 9/13 MH
= 100 x 21060/78000 MHz
= 100 x 0.27 MHz
= 27 MHz

Well, that's all sounding good. So I modified clocking.vhdl to create the chain of MMCM instances to generate the exact clock frequency.  First go at synthesis, Vivado threw an error, because it is not possible to directly assemble such a long chain of MMCM units.  To solve this, I added a Global Buffer unit to take one of the intermediate MMCM outputs, so that it would be available at the next MMCM with minimal distortion or skew.  Hopefully that will solve the problem.

Next problem is that the last step multiplies the intermediate clock by 9, and ends up out of the valid range of 600 - 1200MHz.  This is because:

100 MHz x 8/10 = 800MHz(ok)/10 = 80MHz
80MHz x 9/5 = 720MHz(ok)/5 = 144MHz
144MHz x 9/13 = 1296MHz(not ok)/13

So I need to reorder these factors, so that the intermediate frequencies remain in the range of 600 - 1200MHz.

100 MHz x 9/13 = 900MHz(ok)/13 = 69.23MHz
69.23MHz x 9/5 = 623MHz(ok)/5 = 124.61MHz
124.61MHz x 8/10 = 996.92MHz(ok)/10 = 99.692MHz
(which then gets used to generate the 27MHz clock via x 8.125/30)

So, I'll just rearrange into that order and try to synthesise again...
Ok, that worked, and the bitstream runs, with HDMI output (still) working.  So it looks like my clock magic has worked just fine.

There are now just a very few timing violations of <0.2ns in the hyperram left to resolve.  That said, they are now a small percentage compared to the clock interval (~6.2ns), so SHOULDNT be a problem, except unless the FPGA voltage dipped, or the FPGA got really hot. I am seeing some weird problems still, though, so I will likely still work to completely eliminate them, so that I don't have to suspect that they aren't a problem, but rather that I can KNOW that they aren't a problem.

Digging through, I found one last sneaky slow address comparison that is likely the cause of that, so time to resynthesise again.  Hopefully that will be the last timing violation there, and I will be able to focus on the remaining bugs I am seeing.

Finally! I now have a synthesis run with full timing closure on the HyperRAM, and even the persistent timing closure problem with the CPU is less than 1ns late on a ~24ns clock, and it is related to the SP register, not the HyperRAM. Thus we can more forward confident that any problems with the HyperRAM are not related to lack of timing closure.

So, speaking of problems, we certainly still have some right now:
1. The prefetch logic has some significant problems, resulting in mis-read data.  The first two or three bytes read ok, and then the next byte read rubbish. The pattern is a bit longer, but clearly something is a bit bonkers.
2. Linear reads crossing a 32-byte boundary read rubbish after crossing that boundary, presumably because there is something wonky with the pre-fetch chaining.
3. The external hyperram seems to be even less reliable, but it has been since the start, presumably because the leads from the FPGA to it are longer.
4. When cache rows are read, the bytes are being spread between the two cache rows, instead of all going into the same row.
5. Copying from chip to hyperram is slow, and a bit variable in speed. It's less than half the speed of filling hyperram, so there is presumably something funny going on with the write scheduling.

First step: Make the pre-fetching run-time switchable, so that I can more easily assess the rest of the system.  That worked fine. My next step was to find and fix a bunch of the bugs in the prefetch logic and related areas. This looks like it might fix (1), (2) and (4) above.

I'm synthesising the fixed version now, to see if it really fixes those problems, but am currently fighting with multiple drivers.  Multiple drivers is when signals are trying to be set from multiple (unrelated) places, and the FPGA synthesis software can't figure out how to reconcile them, for example, if they are set from completely separate pieces of hardware.  They aren't too hard to fix, just a bit fiddly.  Fortunately improving the timing closure has dropped synthesis time back down to about 30 minutes instead of 45 minutes or so, so the going is a bit quicker now.

What I have done as a first step, is to get a working configuration for the HyperRAM.  It seems that using 80MHz clock for 160MB/sec throughput (via DDR) is just too much for the current PCB layout (or for how I have implemented it in the FPGA).  By leaving all stages of the transaction at 40MHz, that fixes a lot of problems, but at a modest cost to read performance, since 80MHz was only being used during reads, anyway.  I also disabled all of the CPU prefetch logic by default, and also the whole read-cache machinery.  The write cache is already rock solid, it seems, so that has been left.

The external HyperRAM still has some problems with writes, which I need to look into. Basically some writes get lost. I am suspecting that the chip select line to the HyperRAM might have slightly faster response than the data lines, and thus the last write gets ignored.  It could be something more sinister, of course, like writes being ignored in the write cache, but the fact that the internal HyperRAM is now rock solid makes me think that that is very unlikely.

The result is a known working configuration, for the internal HyperRAM at least, from which to try to debug everything else.  Everything else now largely means the read cache machinery.  The downside, in the interim at least, is that the read performance is, without the caching, quite poor, while linear writing is a lot faster, due to the functioning write caches.  So we get a result like the following (remember the HyperRAM is what is used as Slow RAM):

Fill Slow RAM: 9,287KB/sec
Copy ChipRAM to Slow RAM: 4,334KB/sec
Copy Slow to Chip RAM: 2,500KB/sec
Copy Slow RAM to Slow RAM: 1,547KB/sec

So, let's go through and find and fix the problems with the read cache.  First, we try to use simulation again, as this is MUCH faster and more informative.  With the reading set to 80MHz, simulation showed no problems. But when I drop the read speed to 40MHz, we do now get some errors.  That's a good thing, as I can investigate their causes and deal with them.

We also have another clue: The read cache works fine when there aren't also writes happening.  This means it is most likely something with the cache coherency logic. Either the read caches are taking too long to get updated when a new value is written that would hit them, or they are simply not getting updated, or getting updated incorrectly.

So, into the simulation to see what is going on there.  It looks suspiciously like chipset DMA reads are being used to reply to CPU memory requests.  That can't be the whole problem, because I have run tests with that disabled, and there are still cache consistency problems. However, I have seen evidence that the chipset accesses were stuffing things up, so its nice to have captured this in simulation.  So, that was an easy fix. Simulation then showed errors caused by a weird fix I had in place from when we were running at 80MHz, which was breaking things at 40MHz.  Having fixed that, simulation reveals no more errors. So time to synthesise that, and then see how it behaves on the real hardware.

Ok. Synthesis is now complete.  Having the cache enabled still causes problems when reading recently written locations. Its just that I can't reproduce it in simulation anymore, which means I need to use other approaches to track the problem down.  I have fortunately made a mechanism to examine the state of the cache in real-time, so that I can tell if a byte is being mistakenly put into the wrong cache row. 

To investigate this further, I have modified the hyperram test programme cache tests to reproduce this problem, and then to show me some information about what is going on. It looks like when we write $18 into $8000808, that it gets erroneously written into the first byte of the cache line for $8000800-$8000807.  Knowing that this is the case, I have been able to create a sequences of transactions that can reproduce the problem in simulation - yay! Well, except it seems to be some other error that I am seeing in simulation. No bother. It's still an error to be squashed, and that I can reproduce in simulation.  Actually, it DOES look to be the same type of error, just another instance of it.

The problem is that the cache_row0_address_matches_cache_row_update_address signal that indicates when a cache row matches the update address for the cache is delayed by a cycle (remember I mentioned this kind of thing, when I was talking about flattening logic by pipelining/pre-calculating signal values).  Probably the easiest solution for this, is to clear the signals whenever I update the cache update address, so that they can never be wrong.  It does mean that on the occassions when they would be right, we don't get the benefit of the cache, but that's livable.

Except life is never that simple. In this case, the cache_row_update_address signal and these precalculated comparison signals are generated in separate design parts, and will generate more of those annoying "multiple driver" problems if I just try to set it in both places.  Instead I'll just create a signal that blocks the use of these signals for one clock after the update address has been changed. That fixes the problem in simulation, so now to see if it fixes it when synthesised.

Well, the problem still shows up after synthesis -- but no longer in the little test-case I made, only in the larger test.  Now to try to figure out how to reproduce it with the simplest test case, so that I can hopefully reproduce it under simulation, where running thousands of memory accesses is too cumbersome and slow.

On initial inspection, it isn't actually the same problem, just a similar one: Now it reads an 8 byte row as all $00, instead of their correct values.  My suspicion is that the problem is similar: The cache row address is being compared with something a cycle after it is being updated, and thus there is an internal inconsistency. More precisely, there is a cache row being loaded with entirely incorrect data.  It looks like it might actually be the current_cache_line that gets exported to the slow_devices controller to reduce read latency. So I'll make that inspectable at run-time, too.

While I wait for that, I'll have another think about the chipset DMA stuff.  This is still being quite weird.  If these DMAs are running, the CPU can still (but less often) read wrong data. This only happens when the cache is enabled, so that's a clue that we still have some cache consistency problem.  Also, the chipset DMAs only seem to return data (or at least, interesting data) when the CPU is accessing the hyperram. That is, the CPU and chipset seem able to get data intended for each other.

I went through with a fine tooth comb, and made sure that Under No Circumstances Whatever could a chipset DMA fetch contaminate any of the cache machinery. That has the CPU now safely reading only its own data. But the chipset is still only receiving data when the CPU is accessing the HyperRAM.  This is still really annoying me, because I can't reproduce it under simulation, even when I let the HypeRAM bus otherwise remain idle. 

I ran simulation again,  to confirm that lots of HyperRAM requests can get sequenced one after the other, and that the data gets received, and that all looks fine. I also synthesised a debug mode that asserts the data ready strobe to the chipset the whole time, and that DOES cause the dummy data to be delivered. Thus I am confident that the plumbing between the two is correct. It just leaves the major mystery of why the HyperRAM seemingly gets stuck handling these chipset DMA requests.

To try to figure out what is going on, I have added a couple of extra run-time switches that will allow me to make the chipset DMA to take absolute priority on the HyperRAM bus, in case the priority managment is the issue. I also found a couple of places that, although unlikely, could be contributing to the problem, and tweaked those.  Basically they are places where it might have been possible for one of these data requests to get cancelled before it was complete. 
However, if those were the problem, then I'd expect the whole thing to freeze up, due to the end of the transaction not being properly acknowledged. So we'll see how that goes.

If this doesn't reveal any clues, then I am starting to suspect that the HyperRAM might be doing something funny with taking a VERY long time to respond. We are talking about ~1 milli-second here instead of the tens of nano-seconds that it should take to service a request.

That's improved things a bit: Some data from somewhere is now getting read for the chipset DMA, but it seems to come from the wrong address, and some CPU memory accesses to the HyperRAM end up being passed out as chipset DMA data.  Also, it still seems that sometimes the HyperRAM takes too long to supply the data.

Finding the address calculation bug was fairly easy.  Also, confirming that sometimes CPU access data gets output is helpful to know, as I can go through carefully to make sure that this can't happen.  This was happening when writing data, which I was easily able to verify by filling a large block of the HyperRAM and monitoring the chipset DMA data.  It probably also happens when reading, but I'll have to make a little programme that does reads in a big loop to make sure.  But back to the writes, this is quite odd, as the data path for feeding data to the chipset via DMA is quite specific, and should only be activated when such a request is running.

Otherwise, we still have the problem of some writes to the trap-door HyperRAM failing, mostly when crossing an 8-byte boundary, but also often enough that detection of the trap-door HyperRAM never properly succeeds. 

Anyway, this post has rambled on long enough, and despite the remaining problem with setting up support for chipset DMA and read cache optimisations, much as in fact been achieved: We now have a perfect 27MHz pixel clock and HyperRAM works sufficiently well to be useful.  Speed optimisations can come later.  Here's the current speed performance of the HyperRAM, which we are calling "Slow RAM" from here on, to avoid confusion with the Hypervisor part of the OS.

So while I am not yet satisfied with this sub-system, and frankly am frustrated at how annoying and drawn-out it has been to get to this point, it is now in good enough shape for me to move onto other more pressing issues, like getting HDMI sound working on all monitors...