Friday 28 December 2018

Super simple Protovision-Compatible Joystick Expander for MEGA65

This post isn't a feature that I had originally planned for the MEGA65, but came about from playing games on the MEGA65 with the kids.  Two kids plus one Dad = 3 players, but of course only two joystick ports, a recipe for problems.  Fortunately, there are some fun games that support 3 or 4 players, mostly using the excellent Protovision 4-port joystick adapter, but it requires a user port, which the MEGA65 doesn't have.

We thought about including a user port on the MEGA65, but the extra cost from making the PCB quite a lot bigger was prohibitive, especially since we don't expect that many people to need the userport -- with the possible exception of playing four player games.

However, rather than say there shall forever be no user port, we decided that instead that we would create a cartridge that turns the expansion port into a user port, in a way that is totally transparent to software.  Thus, it just means you can use either the expansion port or the user port at any point in time, which probably solves the needs of almost all users, especially since the MEGA65 has built in freezer, RAM expansion, ethernet, microSD card and so on -- in short, there shouldn't be too much that you need to plug into the expansion port, unless you want to use a game cartridge, but then none of those (that I am aware of) require the user port at the same time.  The only device I can think of that would be a problem would be the combined cartridge + user port EEPROM burner I have stashed somewhere, but even then the cartridge ROM could be copied and run from RAM.  Let's just say that for the very few cases where you might want both ports that there are probably work-arounds, or you can pull out your real C64.

Anyway, since I had no user port, and wanted extra joystick ports, I thought I might solve both problems at the same time by creating a simplified version of what will eventually become the MEGA65 user port cartridge, that just provides two extra joystick ports.  I also want it to be as simple and cheap to build as possible, ideally using only passive components, perhaps even only wires to the joystick connectors and between pins on the expansion port.   I would also like the cartridge to not damage a C64 or C128 if it is inserted, which means it has to play nicely with the existing use of the expansion port.

What I came up with, was to make a cartridge that ties the /DMA line to ground, and then directly connect the joystick lines to the lower data and address bits.  This requires no active components at all.  Also, because pulling /DMA low causes the CPU of the C64 or C128 to pause, and thus release the address and data lines, it should be safe from that side.  For the VIC-II when connected to a real C64/C128, it isn't quite as simple, because the VIC-II can cause memory accesses, and can drive the address lines itself. This is all only a problem if you actually use a joystick, because if the joysticks are idle, then none of the lines are pulled low.  Thus, while the cartridge should never be inserted into a C64 on purpose (because it won't work), it shouldn't break anything. 

The safety of this could probably be improved by tying R/_W to GND, so that even if the VIC-II did ask for a memory access, the RAM will stay off the bus. Better yet, the cartridge should first detect the presence of a MEGA65 in some way.  The trick is how to do this passively.  Perhaps the most sensible approach is to find some signal we can drive to GND from the MEGA65 side, and that is normally high on the C64 or C128's expansion port.  This signal can then be used as the source for the GND line on the joysticks. The /RESET line is probably a good choice for this, as it is only very transiently low, and when it is low, the contents of the C64/C128 it is connected to should, by definition, all be silent and buses in tri-state conditions. Indeed, with this approach the only opportunity to cause grief would be if you tried to use the joystick while /RESET was active, which would require some degree of intentionality, and would be limited to a few microseconds, which is probably much too short to cause any damage to anything. 

In short, this sounds to me like quite a good and simple solution -- provided that the MEGA65's /RESET line on the expansion port can sink enough current.  If this proves to be a problem, the addition of a single drive transistor on the cartridge would certainly solve the problem. Thus, I think we are good to go.
EDIT: For my joysticks with lights, the current is too low, which causes the computer to think fire is being pressed when any direction is selected, because the current draw of the LED is already reducing the voltage enough to be marginal.  But as mentioned, this is easily fixed with the addition of a driving transistor.

All that was left was to implement the VHDL that detects the /DMA line being pulled down from start up.  Until such time as /DMA goes high (which would be immediately on any normal cartridge), then the expansion port goes into normal operation mode. However, if /DMA remains held low, then the expansion port disables itself, and instead sets up the data and address lines to input, pulls /RESET low, and reads data from the data and address lines and feeds them to the CIA that normally handles the user port and acts exactly as the Protovision joystick expander, so that software does not need to be modified at all.

I had an old broken MACH5 cartridge, whose ROM was lost long ago floating around, which I decided to use as the PCB, since all I needed to do was tack on the wires for the joystick, and connect /DMA to GND.  Here is the result:

 The eagle-eyed among you will notice that: (1) this is from before I came up with using /RESET as the GND for the joystick.  I'll be fixing that momentary. (2) There is only one joystick port connected. That's because I have only two children, and thus only needed 3 joysticks at the same time.

Here it is plugged into the MEGA65 prototype on my desk:

Now, as mentioned, I needed three joysticks. I had already made two arcade style joysticks some time back, but still needed an extra one. The two with red buttons were the ones I had made already.  So I went to Jaycar Electronics and bought parts for a third, although I accidentally got a larger box than I used for the other two.  Not a problem, we now have the two kid-sized joysticks and one "papa joystick", which is also identified by the green light.  So now I had three joysticks, and we could try some multiplayer games:

That's Shotgun that we are playing there, which is indeed MUCH more fun with more than two players.

We also had a go at frogs, which like shotgun is a free game from Dr. Wuro.  It is very nice that he has made some fun free games. We can now confirm that both work perfectly on the MEGA65, including with the extra joystick ports.

So now the kids and I can play games together on the MEGA65 all at once. We might even make a simple free multiplayer game of our own to celebrate.

M.E.G.A 6502 benchmark for checking cycles per instruction

Recently I was tweaking the 1MHz instruction timing and debugging some related glitching I was seeing, and realised that I didn't have any really good way to check whether the MEGA65 was really running every instruction with the correct number of cycles.  The existing well-known C64 benchmarks, SynthMark64 and BoulderMark, don't really give this kind of direct break-down.  So I wrote one that does:

Basically it works by going through all the legal 6502 opcodes, and timing how many cycles they take.  Actually, it runs each instruction 256 times, so that if the CPU is fast, we can accurately measure fractional cycles taken.  This is of course quite important when running on a MEGA65 at 40MHz!

To achieve this, the test program works fairly similarly to SynthMark64 (and is also written primarily using the CC65 C compiler):  It constructs a test routine for each instruction that sets up a CIA counter, runs the instruction, stops the counter, subtracts the overhead of fiddling with the timer, and then divides the result by 256 to get the number of cycles per instruction.

In reality, it is a bit more complex, as there are a bunch of instructions where we have to get a bit clever-pants. For example, for PHA, we have to pop something off the stack after stopping the timer, while for PLA we have to first put something onto the stack.

Some instructions like RTS require that we push a return address that exactly matches the next instruction in the stream to run.  This turns out to not be that hard to implement. First, insert the instructions to push something to the stack.

 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    // Push two bytes to the stack for return address
    // (the actual return address will be re-written later
    test_routine[offset++]=0xA9; // LDA #$nn
    test_routine[offset++]=0x48; // PHA
    test_routine[offset++]=0xA9; // LDA #$nn
    test_routine[offset++]=0x48; // PHA

Then once we know the address we should return to, replace that:

 // If instruction was RTS or RTI, rewrite the pushed PC to correctly point here
 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    addr=(unsigned short)(&test_routine[offset-1]);

RTS is also one of the few instructions where we can't actually count 256 executions in one loop, because we would need a bigger stack than exists. Basically any instruction that touches the stack falls in this category. As a result, those instructions will seem a bit faster when running at full speed than they really are.  However, the effect is somewhat less than what SynthMark64 suffers in similar circumstances, where it doesn't correctly adjust for the reduced cost of the overhead of fiddling with the timers.  It is that problem in SynthMark64 that results in impossibly fast reported speeds for NOP instructions on the MEGA65.

Anyway, as can be seen above, the MEGA65 now uses exactly the correct number of cycles per instruction when simulating 1MHz mode.

The three speed figures at the bottom report how fast the machine seems to be, compared with a stock PAL C64.  These figures are based on different weightings for each opcode.  FLAT uses an equal weighting, i.e., it assumes that LDA and RTI both are executed equally frequently, which is clearly not realistic. It does, however, give a good general idea of the machine's speed.  For the other two work-loads, C64 BASIC and BLDERDASH, I used the real-time CPU instruction capture program, ethermon, that I wrote about in the last post to gather statistics.  In fact, I modified it, so that if you give it the -f option for instruction frequencies, it outputs the table in exactly the correct format for me to include in the source for this benchmark program :)

If we then enable full speed on the MEGA65 and run it again, we get something like the following:

First, ignore the information claiming that BRK runs at 255 cycles per iteration. BRK is the single instruction that I haven't implemented testing for, although it should be possible. I got a little lazy, because BRK is not an instruction that tends to get used a lot.  While the test is running you can use the cursor keys to move the cursor around the test field, and it will show you information for the selected opcode. Thus, if you see an instruction that is green or red, you can fairly easily select it, and see exactly the detected difference in speed.  There are still a few wrinkles in this (including that it sometimes displays the line twice, as above, but it mostly works.

A bigger problem at the moment is that the reported speeds can report lower than correct sometimes when running on very fast CPUs. This is because the 1MHz ticks of the CIA do not occur every CPU cycle.  Thus it is possible for the number of consumed 1MHz ticks to sometimes be one more than it should be, if the phase difference means that a 1MHz cycle boundary is crossed.

Similarly, the overhead calculation can vary by 1 cycle as well, between zero and one cycle.  I can't think of a really good solution to this jitter, that is machine independent. It really is just a side-effect of trying to measure something that can be only a tiny fraction of a cycle with a clock that counts only whole cycles.  I could try to average the overhead over many tests, and use the average figure, but that would also require some care in case you changed the CPU speed while the test is running.  But at the end of the day, the contribution of the jitter on the overhead is relatively small, and should be mostly self-cancelling, since it will be rather random for each opcode, and thus overall should end up being near the correct average value, and this seems to be played out in practice.

A bigger problem was in implementing this, I accidentally introduced a systemic bias where the jitter was only being counted in the direction of slower measured speeds, combined with running the stack-based instructions only once.  This was solved by summing the overhead of each of 256 iterations of the stack-based instructions as well as the time taken for each instruction, and then deducting the entire overhead at the end, instead of calculating it for each iteration and removing the cases where the overhead was one cycle but the instruction run was measured at zero cycles, i.e., giving an apparent run time of -1 cycles.  By instead summing these over all 256 iterations, these were counted and cancelled with the occasions where a full 1MHz cycle was claimed to be consumed, and on average, approximating the correct time consumed.

We can now see all the cycle times are fractional, and for the faster instructions are reported as .0 cycles (I kept the output fields only 2 chars wide so that it can all fit on the screen at the same time. As a result it is a bit squashed.  I could also make an 80-column C65/MEGA65 only version, but I wanted it to also run on a stock C64).  They also show up in green because they are faster than on a stock C64. If they were slower, they would show up in yellow (because red on blue on a stock C64 is a known disaster colour combination).

By using the cursor keys you can even select an instruction to give more details on, which is indicated by the reverse-video cursor, and the info given below the instruction table:

So we see that the MEGA65 is close to 40x stock C64 speed, depending on the work load, but with considerable variation among the instructions. This makes sense, since its CPU is 40x faster, and has broadly similar native instruction cycle counts, with some faster and some slower.  The reason some take more native cycles than on a 6502 is because the MEGA65's CPU cannot read from an address immediately after computing it from instruction arguments, as 25ns isn't enough time for the data from one memory cell to make its way into the CPU, be computed on (eg, adding an index register), and then be sent out to the RAM address lines.  I keep meaning to use these latent cycles to pre-fetch the opcode of the next instruction, but given the CPU is already so fast, it has never really become a priority.

The careful observer will also see that the 4th workload to the mix is rather unexpected:  The folks on the #c-64 IRC channel pointed me to the C64 bitcoin mining program (yes, someone actually wrote one!).  It is of course so slow as to be useless, even when running at 40x original speed.  But I guess I now have the record for the fastest bitcoin mining C64, at a rate of 100 nonces in a mere 6 seconds!

Talking earlier about the real-time ethernet-based CPU instruction logger, I mentioned that the 100mbit ethernet isn't fast enough to log instructions at full speed at 40MHz.  But we can now easily check just how fast we can do it:

Not bad! We can even run slightly faster than a stock C65, while logging every single instruction and every single video badline / raster line advance event!

As this is a new benchmark (although it will likely evolve a bit, in particular, because the reported speeds are still a bit jittery, and it runs a bit slowly, testing only one instruction per frame, so as to ensure the test is running in the vertical borders to avoid the effects of badlines), it would be great if folks were able to test it on some different systems.  Groepaz has already kindly run it on his Turbo Chameleon, which reported speeds of: FLAT=16.4x , C64 BASIC=17.7x, Boulder Dash = 17.54x and Bitcoin 64 = 15.47x (although these were produced before I fixed the jitter, so actual results might vary somewhat). 

EDIT: I have now uploaded it to CSDB so you can try it out on your own hardware if you wish.  I'd love to receive speed reports for different systems.  Note for now you will need to manually activate acceleration, as the program itself does not do this for you.

Real-time CPU tracing

We already have the means to perform low-speed tracing of the MEGA65's CPU via the serial monitor interface.  This has been invaluable for all sorts of things so far.  However, I have been investigating a number of gritty little bugs that only show up when the CPU is running a full speed.  A good example is when Ghosts'n'Goblins sometimes does this:

The underlying problem is that the C65/MEGA65 register at $D031 gets written to.  Actually, it looks like $D030 - $D03F (at least) get written with the pattern $78 $50.  However, single-stepping, the problem never shows up, perhaps because it requires an IRQ at a critical time, or is due to some other dynamic effect.

Thus we need a way to debug the machine at full speed. The serial monitor interface is limited to 4mbit/sec in practice, which even with an efficient encoding for instructions would limit us to well below 1 million instructions per second, probably significantly less in practice. Also, the ROM for the serial monitor is already rather full, so there isn't really space to implement such an encoding, and its memory access interface to the CPU would change the timing of the instructions, which might well hide the problems we are trying to track down.  Also, it would not be effective at all if we want to debug bus accesses instead of instructions, as those happen at 40MHz (the current speed of the MEGA65's CPU).  So we need a higher bandwidth option.

The logical choice is to use the 100mbit/sec ethernet adapter.  Not only does it have 25x the bandwidth, but the video streaming code already has most of the infrastructure needed -- all that is required is to add the code to pull the CPU instructions in, pack them in a buffer, and then push them out in packets.

The first challenge is that the MEGA65's CPU can issue instructions on back-to-back cycles, as implied mode instructions typically take only one cycle. This means have to be able to write the entire instruction debug information vector in a single cycle.  Fortunately the BRAMs on Xilinx FPGAs can support funny arrangements where you can write 64 bits wide, and read 8 bits wide.  There is a bit of magic to making this work, as you have to get Vivado to infer the correct BRAM mode.  Xilinx have some good documentation on example VHDL templates that will achieve this, which I was able to adapt quite easily.  So now I could write 8 bytes at a time into a FIFO, and read it out byte at a time when sending ethernet frames.

It is a bit annoying that we are limited to 64 bits, as it would be nice to have the program counter (16 bits), instruction (up to 24 bits), flags (8 bits), and stack pointer (16 bits), as well as the A, B, X, Y and Z registers (40 bits more).  That makes a total of 104 bits.  So in the end I have had to leave the B, X, Y and Z registers out.  I could get all clever-pants later and provide less frequent updates of those registers by providing them in cycles between instructions, however, for now I don't really have a need for that.

So now I could produce 2KB ethernet frames packed with 512 x 8 byte instruction records.  Next, I had to get the logic right to pause the CPU when the buffer was filling, but ideally not have to pause the CPU every time we have a packet.  Using 2KB packets and a 4KB BRAM, this means I can have one packet being sent, while preparing the next one, and only pausing the CPU if we approach the end of the packet we are preparing, and the ethernet is still sending the other.  The logic turned out to be almost elegant in its simplicity here:

        if dumpram_waddr(7 downto 0) = "11110000" then
          -- Check if we are about to run out of buffer space
          if eth_tx_idle_cpuside = '0' then
            cpu_arrest_internal <= '1';
            report "ETHERDUMP: Arresting CPU";
          end if;
        end if; 

I just reused the existing compressed video sending logic in the ethernet controller, so there is (currently) no way to tell the packet types apart, other than looking closely at the content. However, as these are tools that are design for debugging, this is a reasonable solution.

I then started writing a new program, ethermon, that listens for these packets and decodes them for display.  The result is really quite nice:

--E---ZC($23), SP=$xxF6, A=$17 : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$17 : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03  
--E----C($21), SP=$xxF6, A=$02 : $0856 : 18         CLC       
--E-----($20), SP=$xxF6, A=$07 : $0857 : 69 05      ADC #$05  
--E-----($20), SP=$xxF6, A=$07 : $0859 : 85 FE      STA $FE   
--E-----($20), SP=$xxF6, A=$07 : $085B : A0 00      LDY #$00  
--E---Z-($22), SP=$xxF6, A=$2A : $085D : A9 2A      LDA #$2A  
--E-----($20), SP=$xxF6, A=$2A : $085F : 91 FD      STA ($FD),Y
--E-----($20), SP=$xxF4, A=$2A : $0861 : 20 E4 FF   JSR $FFE4 
--E-----($20), SP=$xxF4, A=$2A : $FFE4 : 6C 2A 03   JMP ($032A)
--E-----($20), SP=$xxF4, A=$00 : $F13E : A5 99      LDA $99   
--E---Z-($22), SP=$xxF4, A=$00 : $F140 : D0 08      BNE $F14A 
--E---Z-($22), SP=$xxF4, A=$00 : $F143 : A5 C6      LDA $C6   
--E---Z-($22), SP=$xxF4, A=$00 : $F144 : F0 0F      BEQ $F155 
--E---Z-($22), SP=$xxF4, A=$00 : $F155 : 18         CLC       
--E---Z-($22), SP=$xxF6, A=$00 : $F156 : 60         RTS       
--E---Z-($22), SP=$xxF6, A=$00 : $0864 : C9 00      CMP #$00  
--E---ZC($23), SP=$xxF6, A=$00 : $0866 : F0 E4      BEQ $084C 
--E---ZC($23), SP=$xxF6, A=$1D : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$1D : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03 

We have the CPU flags on the left, followed by the hex version of the same. Then comes the bottom 8-bits of the stack pointer (remember on the 4510 you can make the stack pointer 16-bit), then the contents of the accumulator, and finally, the program counter, instruction bytes, and the disassembled version of the instruction.

There was a bit of fun to get the program counter correct, as some instructions pre-increment the program counter.  This just meant a bit of fiddling to get the out-by-one errors squashed in the display.

What is really nice is that at 8 bytes per instruction on 100Mbit ethernet, is that we can log perhaps a little over one million instructions per second.  That's enough to log a C65 at full 3.5MHz, unless you are running almost exclusively single-byte instructions, which is implausible in practice.  Certainly there is no noticeable slow-down of the machine when logging to ethernet like this at 1MHz, 2MHz or 3.5MHz. It is only when running the CPU at full speed that it limits things.

 Again, it depends a bit on the instruction mix, but typically it can log at an effective CPU speed of about 5MHz -- certainly fast enough for live use.  Also, for timing sensitive bugs, the fact that it runs at full speed for 512 instructions, and then pauses for a while before running the next 512 instructions at full speed means that it is still very likely to catch bugs, as most instruction are executed completely unaffected in terms of timing or bus pausing.

In fact, the real problem is that logging millions of instructions per seconds results in hundreds of megabytes of log for every few seconds.  This means that I need to think about careful instruction filtering strategies, so that it will be easier to debug problems around particular pieces of code.  I have already added an option to capture a fixed number of instructions, and to capture all instructions for exactly one frame.

Anyway, back to trying to figure out what is going wrong with Ghosts'n'goblins using this now, I have been rather frustrated!  Even after adding a mode that logs every single bus access, no access to $D031 is visible -- and yet, instrumenting the VIC-IV to toggle a line every time $D031 is written to, I can see that it is indeed getting written to.  I pulled my hair out for quite some time on this, because I could see the glitch happening within a few seconds.  Then I finally a penny dropped: The problem happens exactly when the screen swaps from the high-scores to the credits screen with the sprite ghosts'n'goblins display near the top.

Moreover, using the ethernet CPU trace, I have seen that the values $78 and $50 that erroneously end up in $D031 - $D03F are the values written to $D00E and $D00F when setting the sprite locations for that screen.  So, now I have a really interesting clue, and I just have to figure out where the code is for this, and then get a trace of it.

In the process I also discovered that the glitch only occurs at 1MHz. If the CPU is set to 4MHz, then it doesn't happen.  Also at 2MHz ("C128 fast" mode) it doesn't happen, nor at full CPU speed. Thus we have some kind of dynamic bus bug.

Anyway, have a clue as to the cause, I used monitor_save to dump out memory from the running machine, and then searched for routines that wrote to $D000 and $D001.  There are two such routines, one at around $3332, and the other at around $33B0.  The routine at $33B0 looks like:

,000033B0  BD 62 34  LDA   $3462,X
,000033B3  99 01 D0  STA   $D001,Y
,000033B6  BD 52 34  LDA   $3452,X
,000033B9  99 00 D0  STA   $D000,Y

If I replace that with all NOPs, then suddenly the problem goes away.  In fact, it is the instruction at $33B9 that causes the problem, it seems.

Changing the STA $D000,Y to just STA $D000 or even STA $D000,X also stops it.

So is the problem with the STA $xxxx, Y instruction?

I also wondered about the addresses involved.  The correct write address will be $FFD000x, while the erroneous writes are happening to $FFD303x.  Thus there is a bit of a pattern there where 0 nybls are becoming 3's.  But one is a low nybl and the other a high nybl, so it still doesn't really make any sense. So lets look at the STA $xxxx,Y instruction in the CPU, and see if there are any clues there.  Nope.  The $xxxx,X and $xxxx,Y addressing modes are carbon copies of each other, just with the different index register supplied.

This left me with glitching due to not making timing closure as the most likely cause.  Basically I will have to investigate it further when I get the chance, as for now, I am just driving myself insane, even if I have made a nice new tool for tracing CPU activity.

Fast forward a couple of weeks, and I have fixed the problem, although I have also forgotten exactly what it was that I fixed!  I have a recollection that it was something in the way the bus is held when running at 1MHz, and making sure that I clean out any old accesses that were sitting on the IO bus.  The main thing is that it is now conclusively fixed.

Saturday 22 December 2018

The last few days I have been trying to track down a few regressions, as I have got other parts of the VIC-IV behaving more compatibly.  A couple of weeks back, I fixed the colour selection for multi-colour mode and variants, as it was doing some strange things.  

In the process, I thought it might also be useful to make the illegal video modes that combine the EBC (Extended Background Colour)  and MCM (Multi-Colour Mode) of the VIC-II do something useful instead of just showing a black screen.  
So I made this combination allow the selection of foreground and background of each character using the upper and lower nybls of the colour RAM (since it is 8-bit on the C65/M65).  It seemed like a good idea.  So I made the little change, and went off and did some other things.

More recently, went and tested a few games, and found that some things were broken now that used to be fine.  In particular, Ghosts'n'Goblins was displaying a line of rubbish on certain screens:


I had been working on some improvements to the CPU/VIC timing, e.g., adding emulation for badlines, adjusting the relationship of when the screen is being drawn and when raster interrupts are actually triggered and making CPU interrupts take the correct number of cycles.  Thus I had expected things to become better, not worse. And indeed some games were a lot better. For example, almost all of the timing jitter glitching around the text notices in Gyrrus suddenly disappeared, but Ghosts'n'Goblins and a few other programs were noticeably worse.  

As I had been fiddling around with timing, I initially suspected that it might have been something like a badline triggering before $D018 was written to in a raster interrupt, and thus causing the wrong row of stuff to be displayed.  I chased my tail around on this for quite a while, before realising that that could not be the problem, and that it must instead be something else.  

After being faced with this, I eventually remembered that I had messed with the multi-colour rendering stuff.  Then using the ethernet real-time instruction logging program I wrote recently, I captured a frame's worth of instructions, and sure enough, it turns out that Ghosts'n'Goblins blanks out that row of stuff by selecting one of the illegal VIC-II modes, as can be seen from the instruction logs below:

 $31FE : AD 11 D0   LDA $D011  
 $3201 : 09 60      ORA #$60  
VIC-II raster $09a (VIC-IV raster $c05) [NEW RASTER]
 $3203 : 8D 11 D0   STA $D011  
 $3206 : AD 16 D0   LDA $D016  
 $3209 : 29 F8      AND #$F8   
 $320B : 8D 16 D0   STA $D016 
 $3252 : AD 11 D0   LDA $D011  
 $3255 : 29 9F      AND #$9F  
VIC-II raster $09c (VIC-IV raster $c0e) [NEW RASTER]
 $3257 : 8D 11 D0   STA $D011 

The instruction logger also gets VIC-IV raster line and badline notifications in the stream, which  makes it very handy for debugging this kind of thing, as we can see that these events are happening at about the right place on the screen, although I need to make those values actually reflect the raster numbering that we are all used to.

So, I reversed out my "clever" re-use of the illegal video mode, and made it correctly display black when in one of these illegal modes, and the result was:

Substantially better, although we can still see that the timing is not yet exact.  However, it is good enough to work for lots of things.

What this whole little adventure has reminded me is how complex it is to achieve a high level of compatibility with even a simple legacy system.  You basically have to assume that any strange behaviour will have been abused at some point, and needs to be reproduced.  Also that good debug tools are a life saver.

Now I just need to work out what has gone wrong with sprite rendering (see the thin gaps between the sprites in the GHOSTS'N'GOBLINS banners).  This seems to be a problem from adapting the sprite pipeline to the new (and otherwise much better and more accurate) video pixel driver rework I did recently.  I think the pixel strobes are not being delayed correctly for the sprites and so the pixel edges are not falling where they should, resulting in stretched or skinny pixels, which if they occur in the last pixel of a sprite on a given raster, can leave these tiny half-pixel wide gaps.  But that will have to wait for another post...

Friday 14 December 2018

MEGAphone - connecting the cellular modem and audio interface

I started writing this post a few months back, and only got around to finishing it off now, as I tried to hook this test rig up, and realised I didn't have a good summary of the required connections anywhere.

The cellular modems for the MEGA65 phone are in the standard miniPCIe form-factor. Note that this does not mean that they use the miniPCIe signalling -- but rather that they use the connector and physical dimensions of miniPCIe.

By using this standard format, the MEGA65 can use a wide variety of cellular modems -- and can even be upgraded in the field if you want to go from 3G to 4G (or later, to 5G when it becomes available), or just switch cellular modems if you are in a country that uses a different frequency band or communications standard.

This also makes the interfacing much easier, and means that we can revise the PCB as we go along, without having to throw away expensive components like the cellular modem. This is also why the FPGA is on a socketed module (the TE0725 from

So, to connect the cellular modem we need to supply it with power (the mechanism to cut power for airplane mode will come later), connect the UART serial interface, so that we can talk AT commands to it, and also the PCM or I2S audio interface.  The Quectel EC25AU module we are using on the bench uses PCM instead of I2S audio, but the two are very similar in practice.  The pins that matter are:

PCM_CLK - A 2MHz square wave clock.
PCM_SYNC - Pulses high for one clock tick every time a new sample is being clocked.  This should be run at 8KHz, to match cellular audio standards.
PCM_DOUT - Clocks the data from the FPGA to the modem, starting with the MSB immediately following a PCM_SYNC pulse.
PCM_DIN - The same as PCM_DOUT, but in the opposite direction.

For simplicity on the bench, we are using a nice little miniPCIe breakout board that is made for these modules, to the point where it even has a SIM card receptacle:

So, we need to know the pins that matter on the EC25AU:

11 - UART_RX  = REFCLK- on miniPCIe
13 - UART_TX  = REFCLK+ on miniPCIe
45 - PCM_CLK = reserved on miniPCIe
47 - PCM_DOUT = reserved on miniPCIe
49 - PCM_DIN = reserved on miniPCIe
51 - PCM_SYNC = reserved on miniPCIe

(Ground and VCC we will ignore for now)

The miniPCIe breakout board has a 24-pin connector, where we find:

1 - GND
5 - REFCLK- (UART RX on the EC25AU)
8 - REFCLK+ (UART TX on the EC25AU)
18 - Pin 45 = PCM_CLK19 - Pin 51 = PCM_SYNC
17 - Pin 49 = PCM_DIN
20 - Pin 47 = PCM_DOUT
So, it looks like everything we need is there.

But there is a wrinkle: The EC25AU says that all of these signals should be 1.8V, not 3.3V. I've made a level converter board to solve this, which didn't work. So I got some pre-made ones, that worked just fine.  The result is a bit of a rat's nest:

The only problem is that the miniPCIe breakout board doesn't actually provide a 1.8V voltage anywhere convenient.  This is because the miniPCIe standard used for these modules doesn't provide such a 1.8V reference. So I need to get it from somewhere else.  This is also good to know before we get further with the PCB design, to know that we will need to supply our own 1.8V reference.

(Actually, we are supposed to be able to run FPGA IO pins at 1.8V, but only if all pins on a given IO bank are 1.8V.  This isn't possible on the TE0725 boards, so we are resorting to the level conversion board.  It isn't a big deal, but it is just another thing.)

Fortunately, the Nexys4 FPGA boards we are using have a 1.8V output we can use, but they don't populate the pin header by default, so I need to solder a header on, which is a bit annoying since it is now kind of embedded into the whole rat's nest of wires that form the MEGA65 phone bench prototype.  So I'll try just sitting a make jumper lead in the hole-through, and hope it works.  It is only used to power a 74LVC245 buffer IC, that takes the 3.3V side and makes it 1.8V, so if it drops out sometimes, it shouldn't be a big problem.  That worked, but I did relent in the end, and populate the 1.8V header by soldering in a simple 2x1 0.1" header:

Voltage conversion in the other direction uses a voltage divider made from two resistors to create a ~2V level for the 3.3V side to read as a logical 1, and a diode to allow that to be pulled to ground when the 1.8V side pulls it low. It's not ideal, but it is what I could build using parts we had on-hand today.

From the Nexys4DDR board, we have to map out the pins we are using on the PMOD connectors for the modem UART and PCM audio interfaces.  The PCM audio is on JD, and the uart is on JC:

JC1 - RX
JC2 - TX

So, to simplify all that, we need the following connections:
JC1 - miniPCIe header 8JC2 - miniPCIe header 5
JD7 - miniPCIe header 18 via level converter
JD8 - miniPCIe header 19 via level converter
JD9 - miniPCIe header 17 via level converter
JD10 - miniPCIe header 20 via level converter
GND - miniPCIe header 1

All wired up, it looks like this:

(But don't forget the level converters in between!)

So now the process is to validate that the level converter works, so that we don't let the magic smoke out the 4G module.

First step: 1.8V from my dodgy pin-in-hole to the 74LVC245. Check.

Second: Check output voltage on 74LVC245 outputs under both high and low conditions. Check.

Third: Check input voltage from diode lifter under both high and low conditions. Swings from about 2.3V to 0.6V from high to low, while staying 1.8V on the low-voltage side.  That should be okay. In fact, it is a wonderful accident that it sits so closely to 1.8V, because the other end of the diode is sitting at about 2.25V, as the roughly 2/3 point between 0V and 3.3V using a resistor voltage divider.  It seems that the diode I am using has a voltage drop of almost exactly 2.25V - 1.8V = 0.45V.  This is a bit lower than the data sheet suggests (about 0.7V), although it is in the realm of germanium diodes (0.3V).  However, this coincidence struck me as as suspicious, so I removed the 1.8V supply to the circuit, in case it was somehow finding its way over to the pin, without seeing any change.  Thus for now, I will assume I managed to fluke matching the voltages exactly.  The main benefit of this is that with matched voltages there will be no wasted power from the mis-match.

So, it seems that everything looks okay in terms of function.

Next step, make sure we can get the cellular modem talking PCM, and that it really does show up on the pins we expect.

The AT+QDAI command controls the audio path for this module:


io=1 for digital PCM
mode=0 for master, i.e., where the modem generates the PCM_CLK and PCM_SYNC signals.  We don't want this once we are hooked up, because the FPGA will generate these signals for both modem slots. But for initial testing, it will be nice to see the modem produce some PCM audio, so we will set it to 0.
fsync=0 for normal short-synchronisation
clock=4 for 2MHz
format=0 for 16-bit linear samples
sample=0 for 8K (or 1 for 16K)

We will ignore num_slots and slot_mapping for now.

So the command we want is:


In theory, before running this, we should see no PCM audio signals from the module, but after, we should see clock and sync, and indeed, the audio data.

Hooking the oscilloscope up, I am indeed seeing nothing before running the command.  After running the command, I can see the PCM sync and clock pulses. I can also see what I think is the audio from the phone call, except that it doesn't seem to react at all when I talk into the phone at the other end of the call.

I did try hooking the audio lines together, so that it would try to play exactly what it heard back over the line, but I still hear nothing in the phone call. Bit it might be that the echo cancellation detects this perfect image of the audio and eliminates it, thus leaving the call silent.

But back to basics, let's confirm that the pins on the breakout do what we expect:

19 = PCM_SYNC - confirmed. Pin JD8 on FPGA board.
18 = PCM_CLK - confirmed. Pin JD7 on FPGA board.
17 = PCM_OUT (from modem). Pin JD9 on FPGA board.
20 = PCM_IN (to modem). Pin JD10 on FPGA board.

The UART for talking to the modem is JC3,4, connecting to pins 8 and 5 respectively on the modem.

Super. So I tried a little trick, and connected PCM_OUT to PCM_IN, so that the modem would basically act as a loop-back of the caller's audio.  This works quite nicely, and you get to discover just how much lag the mobile network introduces! Close to half a second in these tests.

So the one thing I had to do to make this work, was to disable  in-call mute with at+cmut=0, as the call was muted by default.

Well, now it looks like everything is ready to hook up the audio path all the way through.

Hooking everything up, I noticed that the PCM_CLK line was being dampened. After a bit of poking around, I discovered this is because the modem still thinks it is in PCM master mode, i.e., it is trying to generate the clock and sync pulses.  This is confusing, because I have told it explicitly via AT+QDAI=1,1,0,4,0 to become a PCM audio slave.

So, some quick changes to the VHDL to allow it to take the PCM clock and sync from the external source, and building a new level converter with three inputs on the FPGA side (PCM_IN, PCM_CLK and PCM_SYNC), and a lot of pulling my hair out at random problems with the test setup, and I started getting sound out the phone that was related to what the MEGA65 was producing.
 One of the weird things is that the sound output from the modem, i.e., for sound coming from the caller, stopped working all of a sudden.  My new level converter is floating to about 2V, and I might have applied some slightly higher voltages to the PCM lines while trying to figure out what was going on, so it is possible I have fried it.  I'm hoping not.

Also there was a lot of fiddling with the audio mixer gain controls in the MEGA65 to get it sounding sensible on the output side.  Initially the audio was over-driving and distorting.  Then it was too quiet. I might need to adjust the scale on the master volume control to allow scaling up quieter inputs more, so that the distortion can be avoided with smaller coefficients in the mixing stage, but still end out loud enough on the output. This might even need to be different for the different audio output types.

To produce sound from the MEGA65 for testing in call, I loaded Impossible Mission and used both the music from the crack intro, as well as the sounds in the game itself.  Hence, while the first words on a telephone were "come here Watson", the first words communicated over the MEGA65 phone were, "Another visitor. Stay a while. Stay forever!"  This seems quite fitting to me.

After all that, I realised I hadn't read the documentation for the modem closely enough: The AT+QDAI command only takes effect when the modem is reset.  Grrrr..  So back to the first level converter board for me.  Now I am having trouble getting the modem to even talk on its serial line to me.  The UART TX line on the modem seems to cycle between 0 and 3.3v every few seconds. Now I am really hoping I haven't fried it.  After a bit of fiddling, it looks like the USB power lead I was using was not connected properly. It is one of those that has two heads for supplying more current, and it is a bit funny which one you plug in, if you only plug in one of them.

After switching that, the modem looks like it boots again, but I am still not getting any serial output from it.  Well, the oscilloscope says that it is indeed producing serial output again, but I am not seeing it on the FPGA. After a bit of fiddling, I have the signal on the FPGA pin again, but still not communicating with my little terminal program.  Restart FPGA in desperation. Now it is there again.  Okay, now I am finally back at the starting line, to try using the modem as a PCM slave.

I can now see the audio in both directions again, which is very nice :)

However, for some reason, the audio isn't synchronised to the sync pulses I am giving the modem.  This manifests itself in the audio coming out being quite garbled, because the modem is sampling at random intervals.  I can also see the audio from the modem is quite unsynchronised with the sync pulses, which is also very wrong.

A little more careful reading of the PCM specification for this module reveals that the bit order is opposite to in I2S, i.e., the least significant bit should come first.  That probably explains the garbled audio output.  So, I'll make that change and resynthesise.

However, the bit order doesn't really explain why the incoming audio samples were not synchronised to the sync pulse, but seemed to be arriving at any old time they liked.  It is possible that they are still being clocked out from the module's internal 2MHz clock. I'll have to examine the relative timing of those a little more to figure out what is going on. If they are clocked from the internal clock, I might be able to do something clever to infer and latch the clock.

Hmm... Now trying another call, I can see the audio going from the FPGA to the modem, but I can't hear anything at all from the phone.  Yet the PCM_CLK and PCM_SYNC lines are still being fed correctly. Also, I am now not seeing the audio from the modem.  Most weird. And rather frustrating that it doesn't behave the same every time.

Synthesis with bit order reversed has now completed, so tried that.  Now I have audio in the call (in both directions again), but still not at all synchronised.

Also, I can see that the audio samples from the modem seem to be at 4KHz, but that could be the bottom bit alternating.

However, I can also see that the sync pulses I am generating are occuring every ~60usec, which is 16.7KHz, not 8KHz, which might be why the modem is refusing to synchronise correctly. Although it does claim to be able to operate at 16KHz.  Anyway, it makes sense for me to fix the SYNC clock rate in the FPGA, so that we are not obviously out of spec. Looking at things closer, I can see that my sync pulses are also only half the width that they should be -- which might well explain why they only get detected some of the time by the modem, and also why the sample rate is double what it should be.

Why it is 16.7KHz instead of 16KHz, I am not sure. This could be the clock frequency on the FPGA being a bit out, but ~4% sounds like quite a lot. Ah, it could be that my clock divider is to blame here: It flips the clock every 25MHz/2MHz = 12.5 clock cycles, but the counter is an integer, thus it will flip every 12 cycles.  12.5/12 x 16KHz = 16.667KHz -- exactly what I am seeing on the oscilliscope.  So mystery solved there.  The solution is to allow for not exactly 50% duty cycle on the sample clock, so that the overall timing can be preserved.

Okay, so now the clock and sync pulses are correct as required in the documentation.  However, when I make a call, I am still seeing rather strange effects.

This strangest thing is that there was no audio at all in the call until around 7 minutes -- at which point, suddenly I started getting audio in both directions, where as previously there was none.  The audio started as I was probing various things with the oscilloscope, which might suggest that there is still some subtle problem with the signalling I am producing.  However, what that problem might be, I am at a loss to speculate.

The problem with the lack of synchronisation of the audio coming from the modem to the sync pulses I am providing continues, and the apparent sample rate continues to be 4KHz, instead of 8KHz. As an experiment, I am now trying to make my PCM transceiver synchronise to the leading edge of the incoming audio sample.  If that works, then I will have some reasonable evidence that the EC25AU is being naughty when in PCM slave mode. Unfortunately that didn't work. I have contacted Quectel to see if they have a solution for this, or indeed, if I am doing something incorrectly.

In the meantime, I am returning my focus to using the modem as PCM master, and making the FPGA be the slave.  My previous level converter had a number of problems, so I gave in and bought a couple of two-line bidirectional level contverters from Jaycar, and after a number of false-starts and general jiggery-pokery to workout what is going on, and get it all together, I think I now have it together.

However, when I tried to power it up and make a call, I noticed that the sync pulse on the FPGA side was <1.8V, instead of 3.3V.  A bit of exploring revealed that I was accidentally driving that FPGA line low, as I hadn't correctly switched things around for the FPGA to be PCM slave.  The fix for that is now synthesising, and hopefully it will get things going.

After a bit more fiddling, I realised that the PCM audio feed is 16-bit signed, not unsigned.  While this isn't such a problem for the output, it is a problem for the audio from the modem, as the sample values frequently cross the zero line.  So, another quick synthesis run to sort that out.  Discovering that this was a problem was a bit complicated by the fact that the modem mixes some of the outbound audio onto the inbound audio to provide "side tone", so that the user of a phone doesn't think that the line is dead.

Finally, after all of this mucking about, I managed to get the audio path working in both directions.  The process was longer than I hoped, and yet, looking back, it took less than two weeks to do, which is not an unreasonable amount of time for a first attempt at creating the complete audio path for a mobile phone.