Sunday 31 December 2017

Connecting a C64 disk drive to the MEGA65 - part 2

The other day I wrote about getting the first communications between the MEGA65 prototype PCB and a real C64 disk drive (my old sick 1581).

At that point, I had it working to be able to read the DOS status exactly once, but then it would hang.  Similarly, trying to LOAD anything (even a disk directory) would cause strange lock ups.  Tracing through where it got stuck using the MEGA65 serial monitor (which I am always so glad I went to the effort of implementing), I found that it was waiting for timer B on the CIA to time out, to determine when it was an EOI condition.  However, timer B never triggered.  This set me on a journey of implementing a pile of missing functionality in our CIA implemenation.

First, I tracked down and fixed the bugs in timer B, and resynthesised the bitstream, to confirm that this indeed fixed the problem I was seeing in C64 mode.

I couldn't tell if it would now work for loading files, as I can't even load a directory on my sick old 1581. Of course, I only got to this point after having borrowed a 1541-II from a local to confirm that I could read the status message only once from that as well -- so even now, I don't know if I would be able to load something or not from C64 mode. However, C65 mode is a different question...

C65 mode, like C128 mode on a C128, uses fast serial when it can.  Long, long ago, I had bodged up the status register on the CIA shift register, that is used for fast serial. This was to work around the start-up code on the C65 that looks for a disk drive, and if present, tries to boot from it.  My bodge was just to always set the shift-register complete flag.  This worked back then, and stopped the boot process hanging like this:

But now that we are trying to talk to real disk drives, it makes the C65 think that every connected disk-drive can do the fast serial mode, which results in hanging.  (Note that while a 1581 can do fast serial, we don't currently have the SRQ line connected on the MEGA65's IEC serial port. This will change in the production versions, so that fast serial mode will be available.)

So, as well as fixing timer B, I spent part of the day implementing the shift-register, at least for sending, so that I could get rid of the bodge, and have, hopefully, both the C65 booting correctly, and working with external drives in C65 mode. This is still a work in progress, but I hope to have it all working in the next couple of days.

There are also some residual bugs I need to track down, which might just be poor timing closure on the FPGA, that cause device not present errors and the occasional corruption of bytes received, such as in the following image. It could also be that my 1581 is sick (it sometimes flashes three blinks on start up, to indicate that it has a problem with its RAM, so this is not out of the realms of possibilities).

The next stages on this will probably have to wait now, until I collect the 1541 that is sitting in the lab at work sometime this coming week.  But I am happy that significant progress has now been made.

Saturday 30 December 2017

Connecting a C64 disk drive to the MEGA65 - first success

I have spent the last couple of days working on getting the IEC serial bus working on the MEGA65.  This has been possible, because we now have the CPU able to run at instruction-level accurate timing, which is generally sufficient for serial disk access. There are two purposes behind this: (1) we just want it working, and (2) we need to make sure that the circuitry on the MEGA65 mother board is correct for the IEC serial port.

I started out with a simple VHDL program that would let me control the CLK and DATA lines via joystick, and show their status on the two LEDs on the mother board.  That is, without a CPU, and without a video controller. The purpose here is to keep the VHDL super-simple, and thus super-fast to synthesise, so that I could iterate quickly, and avoid the 2 - 10 hour synthesis runs that have become normal.

We already knew that we had forgotten to specify the inclusion of pull-up resisters on the IEC serial lines, and this was indeed confirmed.  So I built an an adapter cable that takes 5V from a joystick port, and uses that to pull-up the serial lines via 1K resisters.  Ideally, it should also have diodes to prevent back-flow of current from the bus powering parts of the computer when it is off, as well. This I have noted in the errata for the next revision of the board, but didn't need to implement on my adapter cable for testing.  I'll probably adapt the 5V take-off on the joystick port into a joystick pass-through cable, so that I don't lose a joystick in order to have the IEC bus working. (It isn't unfortunately just a case of not using the cable when I don't want to use an external drive, because without it, it looks to the C65 ROM like there is a drive on the bus from which to try to boot, so the computer just hangs on start without it).  Here is the little cable I made up:

A side-benefit of this little cable with exposed pull-up resistor, is that it gave me somewhere convenient to attach the oscilloscope probles, so that I could watch what was going on:

After that, it was a bit of trial and error with my poor old 1581, that doesn't seem to be able to read disks any more.

 Then finally, once I had everything properly connected, and the, and the device ID set correctly on the 1581, it was success!

And here is the video of it:

Because I am using the MEGA65 in headless mode like this, I also extended the monitor_load program that can load a bitstream, and load a program into memory and do other useful stuff, to also allow typing input into the computer, as though it were being typed on the real physical keyboard. This uses the keyboard virtualisation layer that already exists in the MEGA65 to allow use off C65/C64/USB and other keyboard types, without having to mess with things.  So all I had to do was tell monitor_load to put the correct keyboard matrix events into the queue, and then I could type like magic.  One of the very nice aspects of this, is that one can run otherwise interactive tests completely automatically.

Here is the command line that I used to type this little disk status check in and run it. Note that I call monitor_load twice, once to boot the machine with a specific bitstream, and then again to do all the typing:

monitor_load -b bin/mega65r1.bit -4 ; monitor_load -T '10open15,11,15~M20get#15,a$:?a$;".";:ifa$>chr$(13)goto20~Mrun'
Now the next step is to figure out why it gets stuck after reading the disk status message once, instead of reading it repeatedly.  After displaying the disk status message once, the CPU gets stuck looping around in the input byte from serial bus routine at $EE13.  Clock and data are high, suggesting that the 1581 is not trying to send anything. What I can't work out is whether this is just because my 1581 is on its death bed (LOAD"$" yields a 74,DRIVE NOT READY,00,00 message), or whether there is still some problem with the MEGA65 side of things.  Unfortunately I don't have another drive here to compare with, and won't be able to get another one for a few days. Hopefully some of the MEGA guys in Germany will be able to test this there, where they do have some drives on hand. If there are remaining problems, I suspect that they will be CPU or CIA implementation related, since the electrical side of things clearly works, if we can instruct a device to come to attention, and then send a number of bytes.

In any case, I have plenty more to get on with in the meantime, for example, testing the ethernet and cartridge ports, and trying to work out why I can't get any audio out of the left channel.

Thursday 28 December 2017

Building Joysticks

Now that the bitstream is working fairly well at 800x600 (at least well enough to cause our children to squabble about whose turn it is to play Boulder Dash next), I am beginning to turn my attention back to finishing testing the physical ports on the first revision of the MEGA65 PCB, and then get as many of those ports working as possible.

The keyboard is already working, thanks to the overhaul of the keyboard input code.  The joystick ports also used to work. The past tense being employed there, because the overhaul of the keyboard input broke it.  I have since fixed that (it was just that I hadn't fed the real joystick pins into the combined joystick input code), and am synthesising the resulting bitstream for testing later today. 

For those with an interest in VHDL, here is the logic that is used to update the joystick state:

      for n in 0 to 4 loop
        joya(n) <= '1' and (joya_physkey(n) or joykey_disable)
                   and (joya_widget(n) or widget_disable)
                   and (joya_real(n) or joyreal_disable)
                   and (joya_ps2(n) or ps2_disable);
        joyb(n) <= '1' and (joyb_physkey(n) or joykey_disable)
                   and (joyb_widget(n) or widget_disable)
                   and (joyb_real(n) or joyreal_disable)
                   and (joyb_ps2(n) or ps2_disable);
      end loop;

Basically for each line on both joysticks, we consider all the possible input sources, and merge them together, these are USB/PS2 keyboard using dedicated keys mapped to the joystick, the MEGA65 Widget Board that can be connected to a Nexys4 FPGA development board to allow use of a real keyboard and joysticks on that, and apparently the USB/PS2 keyboard a second time (I just noticed this, and will have to investigate why on earth I have it twice).

However, getting to the point of being able to test joysticks has led me to a long term problem I have had: I don't have any properly working C64 joysticks left. I have a Quick Fire 2, with its lead in very poor condition, with loops of wires on the outside holding it in strange kinks that cause it to sometimes work.  I could have fixed that, but the quality of the switches and everything would mean that it would soon die again.  Also, it wouldn't solve the above-mentioned problem of both of our children wanting to play games at the same time.

Thus I finally got around to doing what I had intended to do for the last 20 years, and build up a couple of proper arcade joysticks.  There is something about the feel of a fully free and well constructed arcade joystick controller, as well as the fact that they are much, much more durable than commodity joysticks, and also much more serviceable. The TAC2 is about the closest C64 joystick I have found, and even those are not as durable, and to me at least, don't quite have the same feel.

Anyway, this adventure in making arcade joysticks was made easier by the fact that Jaycar now stock the joystick assemblies and buttons, as well as cheap plastic boxes in which to put everything.  I had previously bodged one up about 20 years ago with parts from a pinball rental company, but Jaycar is much, much cheaper, and the quality of the mechanisms seem to be pretty good.

Wiring up a C64 joystick is really simple: There are dedicated lines for up, down, left, right and fire, and these get connected through the switches to ground.  If you want any electronics in there, e.g., an auto-fire function, there is also 5V available.  There are also two lines for the analog inputs, but we are not using those.

The arcade joysticks have spade connectors, so I got some nice black 7-core cable and put spade connectors on one end, and the correctly wired DB9 on the other end.  I did try to get all clever-pants, and bought the cable as DB9 serial null-modem cables, with the female DB9 on the end, however, they didn't have all nine pins connected, so I had to go for the hand-soldered and screw-box cover. Here is the joystick wired up with spade-connectors and a soldered ground-loop, so that if I have messed up the joystick directions in the wiring, I can easily fix it.

If I were building an arcade game for public use, I would have used spades for the ground-loop as well, so that the micro-switches could be more easily replaced as they wear out. However, for home use, that is unlikely to ever happen, and a bit of remedial soldering doesn't scare me. The spade connectors also mean I can switch one of them to being "left-handed" (button on right), if my daughter's fears about not being able to use a "right-handed" joystick prove to be genuine.  I should say that I am also left-handed, but don't have trouble using a right-handed joystick, so we will see how we go.

Anyway, here is one of the finished units. Total cost per unit, about AU$45 (ca. 30€).

They are quite big, which is partly necessitated by the rather large joystick mechanisms, and also because I like a big solid joystick, so that you can whack the stick around as you need, and thump the button with impunity.

The arcade buttons from Jaycar have an LED in them, so you can make the button illuminate, which I decided to do on one of them. The other was already sealed up, so I didn't make the modification to that one, but I might another day. For now, it means that I have an easy way to tell them apart.

I'm pretty happy with the result. They are unlikely to ever appear in an Ikea catalog or anywhere else where styling is important, however, they are solid, should last the next 20 years, and have that nice arcade game feel to them.  Now I just need to wait for the FPGA synthesis to finally finish, so I can give it a try...

Tuesday 26 December 2017

A day of squashing bugs

Part way through today I was feeling that I hadn't really achieved very much, but then as I started thinking about it, and making a list, I realised that I have done more than I first thought.

The list of bugs squashed since the 24th now includes:

24DEC17 - 800x600 video modes work
24DEC17 - Joystick input not working
24DEC17 - CPU bug fixed (Boulder Mark etc runs fine)
24DEC17 - b0 command in UART monitor stops CPU on BRK instruction
25DEC17 - Fix $DC00 always reading as zeros
26DEC17 - Fix sprite fine horizontal placement problem
26DEC17 - PDM/Sigma-Delta audio output working (audio was broken)
26DEC17 - Kickstart looks for file "NTSC", if not present, switches to PAL
26DEC17 - CIA clock speed is always 1MHz, except in C128 2MHz mode.
26DEC17 - Fix CIA clock halving bug
26DEC17 - $D016 smooth scroll in 320H mode fixed
26DEC17 - CIA is 1MHz even in 2MHz mode (turns out to be the correct behaviour)
26DEC17 - NumLock on PS/2 / USB keyboard is now "joystick lock" (WASD+shift, cursors+space)

This has largely been a process of getting all the stuff working again that has broken during the change of video mode, and the incorporation of bitplanes and other missing features.

While there are still a lot of fixes to be made, it is now back to the point where a number of games load and run just fine, and with the improved joystick emulation, these are both easier and more fun to test.  This of course necessitated some play-testing.  There are still quite obvious display glitches, but many of these are due to a relatively few bugs, including some raster position bugs that are on the list to work through.

For now, I will leave you with a set of images following the process of startup, to playing a game: Kickstart loads C65 mode, from there to C64 mode, mount a disk image from SD card, load the menu from the disk, play the game.  Of course, this will all be much more streamlined for final release, but it is already quite usable.  My 7 year old son is just about on top of how to load and play Ghosts and Goblins, which is his current favourite.

Sunday 24 December 2017

Preparing to implement the VIC-III DAT

Bitplanes are now mostly working, although we are still tracking down some remaining bugs with V400 mode, and some bitplanes being shifted by 16 pixels.

The next challenge is to implement the VIC-III Display Address Translator (DAT). This is a little piece of logic in the VIC-III that makes it easier to modify pixels in the bitplanes.  Basically you provide it an X and Y coordinate, and it maps the corresponding bytes of the bitplanes to $D040-$D047, so that they can be manipulated, without having to do any crazy arithmetic to find them.  This makes the bitplanes on the C65 less cumbersome than they would have otherwise been. This is how the C65 specifications document describes the DAT:

                   Display Address Translator (DAT)

     The  C4567R6  contains  a special piece of hardware, known as the
Display  Address  Translator,  or DAT,  which allows the programmer to
access the bitplanes directly.  In  the  old  VIC  configuration,  the
bitmap  was  organized  as 25 rows of 40 stacks of 8 sequential bytes.
This  is  great  for  displaying  8  x 8 characters, but difficult for
displaying graphics.

     The  DAT overcomes the original burden by allowing the programmer
to  specify  the  (X,Y)  location of the byte of bitplane memory to be
read,  modified,  or  written.  This  is  done  by  writing  the (X,Y)
coordinates  to  the BPX and BPY register, respectively.  The user can
then  read,  modify,  or  write  the  specified  location  by reading,
modifying,  or  writing one of the eight Bitplane registers.  There is
one bitplane register for each bitplane.

     The  DAT automatically determines whether to use 320 or 640 pixel
mode,  and  whether to use 200 or 400 line mode.  It will also use the
areas   specified  for  the  bitplanes,  using  the  Bitplane  Address

Anyway, that is the purpose of the DAT, how it is implemented is rather interesting. The C65 specifications document tells us, somewhat cryptically:

DAT -- Display Address Translation

     Display  Address  Translation,  or  DAT fetches, are not actually
DMA-type accesses, but rather CPU address redirections to RAM. In this
case,  the  unmultiplexed  address  bus  is totally separated from the
multiplexed address bus.

That is, it suggests that the VIC-III somehow causes the target address to be re-written, when the CPU accesses $D040-$D047. To better understand what is going on here, one has to understand how the VIC-III is the gate-keeper of all RAM accesses on the C65.  While the 4502 is an 8-bit processor, the VIC-III is a partly 16-bit graphics chip.  More specifically, it has two 8-bit wide data buses, D and E, each of which is normally attached to 64KB of RAM.  In order to have enough bandwidth for the bitplane modes, it can simultaneously fetch from both the D and E buses. Because it uses DRAM, and also looks after the CAS/RAS selection, the whole process is somewhat complicated.

However, as the DAT fetch description above describes, it is possible for the VIC-III to completely substitute the address that the CPU provides, for one that it has provided internally.  This indeed works very nicely to allow the DAT to be a very simple piece of hardware in the VIC-III.

So, now the question is how to implement the DAT on the MEGA65.  We don't have this bus arrangement on the MEGA65: The CPU has direct access to memory.  Thus we can't implement the DAT solely in the VIC-III. The CPU itself does have memory address re-writing capability, however, so we could implement it there.

The question then becomes whether to calculate the bitplane addresses and offsets in the VIC-III, and export them to the CPU, or to calculate them in the CPU, which means the CPU has to sniff VIC-III/IV register writes, in order to know what mode we are in, and where the bitplanes are located in memory.

It is probably simpler and safer to export the bitplane start addresses and video mode flags to the CPU, and have it do the address computation on those.  This requires 8 bitplanes x 3 address bits per bitplane = 24 bits of address information, plus the H640 and V400 flags. In reality, we need 2x24 bits of address, because the 400 pixel high modes are interlaced, and pull data from two separate sources in that mode.  In both cases, the CPU then needs to calculate INT(Y/8)*(320/8*8)+ (Y and 7)+INT(X/8)*8 + bitplane start address for 320x200 bitplanes.  For the V400 modes with interlace, we need to use Y/2 instead of Y, and to use the bottom bit of Y to pick the odd or even bitplane address. For the H640 modes, we simply change the 320/8*8 to 640/8*8.

I could have written those as simply 320 or 640 instead, however, I wanted to make transparent that the addresses are based on C64-style bitmap addressing, where each 8x8 character-sized cell is formed from 8 bytes. Thus the pixel at (0,0) (measured from the top-left of screen) will be in byte 0, while the pixel at (8,0) will be in byte 8, because after every 8 pixels in the X direction, you have to skip 8 bytes, not one, because you are moving to the next character cell. If that sounds confusing, search for a tutorial on the C64 bitmap mode.

So, taking all off that into account, this means that the CPU needs to look at the DAT X and Y positions, H640 and V400, and from those, it can calculate the offset into a bitplane, and whether it is the odd or even bitplane as the source. We really want to avoid multiplication by big numbers in the CPU, so we can instead calculate the bitplane offset as:

offset = INT(Y/8)*256 + INT(Y/8)*64+ (Y and 7) + INT(X/8)*8

While we still have some divides and multiplies here, they are all powers of two, so can be done by simply shifting bits left and right as required.

As is often the case, by explaining something, it is possible to find improvements.  Whereas I had previously figured I would need to sniff the DAT X and Y positions, I can instead calculate those in the VIC-IV, and export only the bitplane offset and odd/even flag to the CPU, together with the bitplane addresses.  Then the CPU will know all that it needs to, in order to redirect memory accesses to $D040-$D047 to the appropriate place in the bitplanes.

Now that I have thought out how I can implement the DAT, I have added it to my list of tasks.

Saturday 23 December 2017

Fixing IRQ following BRK / BoulderMark score = 102x stock C64

The BRK instruction on the 6502 is effectively just a software-triggered IRQ. The only difference is that a special bit gets set in the processor flags when pushed on the stack, and that BRK increments the program counter by two.

I had been seeing a problem for a while now, where some instructions would not work reliably. I thought that I had a bug in the branch instructions, but it turns out the problem was in BRK, which is used in the Lorenz test programs for branches, to find out where the address a branch has gone to.  Basically these tests would sometimes fail. But only sometimes. Intermittent bugs can be a real pain, and this was no exception.

First, I instrumented the Lorenz test program for BNE to report the expected and actual branch address, and then to infinite loop in the test.  I noticed that it was always 2 bytes later than expected.  My first guess was that the program counter was being mistakenly incremented during a branch instruction under some special condition.  By accident, I discovered that it was dependent on an IRQ occuring. 

This really was a piece of luck, as I had disabled IRQs on a Nexys4 MEGA65, so that I could single-step without always ending up in the IRQ routine.  I found that I couldn't make the problem occur in that mode, even if I set the CPU free running at 50MHz overnight. 

This was a major clue to tracking down the problem, although I still thought the problem was in the branch instruction not setting some program counter control flag correctly. But then I remembered that BRK is a software interrupt, and if it got tangled up with a real IRQ, and I wasn't handling things properly, then things could go bad, in exactly the way that I was seeing.

So, after having bashed my head against this bug on and off for a few months, it tool two lines of VHDL, to make sure that an IRQ could never get confused with a BRK instruction.

What was particularly strange about this bug in the end, is that I had noticed that at some point along the line the BoulderMark benchmark for the C64 had stopped working. This is part of what made me suspect it was the branch instructions, as I could see that faulty branching could certainly make things crash -- although why it should only happen in some programs and not others was a mystery.

Now, the situation has been turned on its head: I have fixed the BRK instruction's interaction with IRQs, and not modified the operation of any other instruction, and suddenly BoulderMark is working again -- even though I have confirmed that it never uses BRK anywhere.  It is a bit dissatisfying, actually, to not know exactly why this has fixed BoulderMark, as it leaves a niggly fear that there might be some other subtle bug lurking still in all this.  But, for now I am just happy that BoulderMark runs again.  While the few changes to the CPU since it was last working stably have been relatively minor, it was still nice to see that the MEGA65 has now ticked over to yielding a BoulderMark score that is >100x that of a stock C64, as you can see below.

Friday 22 December 2017

Automatic VIC-II Compatibility Regression Testing - Part 1

One of the challenges of implementing the MEGA65 is to be sure that we have compatibility with the VIC-II and other custom chips in the C64.  This gets more complex as we fix bugs and add missing functions, as we might accidentally introduce a regression. 

This is of course why test cases are very useful, and there are a bunch of test cases as part of the excellent VICE emulator suite. However, from the ones I have looked at, they are rather specialised, and there is no trivial way to run them all, and get informed when one or more of them fails.  (It is possible that I have the wrong end of the stick here, and that this can be done with VICE, and I would welcome correction on this point.)

I have decided to fix this state of affairs, and write a single program that runs through a large number of function and compatibility tests, and provides meaningful failure messages, so that whenever we build a bitstream, we can check to make sure that nothing has gone backwards.

My expectation is that this program will grow in fits and starts according to our needs, and may end up being a set of automatically chaining test programs, similar to the 6502 test suite for the C64, that has programs that test a specific instruction, and then load the test for the next instruction when done.

Anyway, I have begun implementing the first tests to help track down a problem with sprite positioning in the new 800x600 video modes.  It starts by checking sprite to sprite collision and then sprite-to-character collision, in such a way that it can work out where sprites are being drawn,relative to the text display.

The program purposely is C64 compatible, so that I can make sure that the tests run the same on a real machine (or in VICE for the simpler stuff, until I get an SD-IEC or Ultimate 1541).  Here is what it looks like running in VICE so far:
The test that fails might actually be wrong -- I am currently researching to find out of a real C64 detects sprite collisions that happen behind the borders. My gut feeling is that it probably does, but I am not yet sure.

Anyway, it is a starting point, and I will post an update when it is somewhat more matured.

Thursday 21 December 2017

Instruction-Level Timing Accuracy is here

It is summer time here in Australia, which means that it is time for holidays for me.  For those of you reading from the Northern Hemisphere, the idea of Christmas in summer may sound crazy. However, I can assure you that this is not the case. It is in fact COMPLETELY INSANE. You have all the rush of Christmas, which should be happening when the weather outside is horrible, and the days short and dark, instead happening just when you really want to be going on holidays instead. So 1/3 of our summer (seasons start on the 1st down here rather than the 21st because of the lower thermal mass of the Southern Hemisphere due to the lack of large continents) disappears into the stress of Christmas, including several days of food-induced coma caused by trying to sustain our European traditions of having a big hot meal for Christmas, even though it might be 42C outside.  Then, come late June, when our weather is at its worst (but this is Australia, so it just means 8 hours of day light, and daytime temperatures around 10C, and nights as cold as 3C. We do however get some pretty nasty wind courtesy of the same lack of large land masses in this half of the world), we have not even a single public/bank holiday for almost four months.

Anyway, we nonetheless survive, and it means I have some time to potter away on finishing off the core functionality of the MEGA65.  Thus, there will hopefully be a number of posts over the next few weeks, before I have to dive head-long back into work.

What I have been working on the last few days is getting the new video modes sorted out once and for all (this has held things up for about a year now), and get accurate CPU timing, at least at the instruction level (cycle-accurate timing within instructions will come a bit later).

So, todays screen shots are of SynthMark64 running on the latest version of the MEGA65 with the CPU at 1MHz, 2MHz (C128-style "fast" mode), C65 native 3.5MHz mode, and full-speed at 50MHz.

I had added a framework for assigning instruction cycle counts some time ago, but hadn't had the opportunity to tune it for some time.  So, while it was close-ish, it was still out by upto 25% for some instruction types. However, after spending a few hours tracking down the problems, including realising I was mistakenly added the one cycle penalty to relative branches only when the branch did not cross a memory page, instead of when it did cross a memory page, I had it pretty much right, as you can see below:

For comparison, here is the MEGA65 running at full speed, approximately 51x faster, which makes sense, since the CPU clock speed is almost exactly that much faster than a PAL C64.  However, as you can see, there is quite a bit of irregularity. 

First, NOPs are listed as a124x faster. This is plainly impossible, as at 50MHz, this would require NOPs to take about 0.8 cycles each, which would take a very special CPU to do.  What I think is going on is that the adjustment in the calculation in SynthMark that substracts the time it takes to setup the timer for the test is based on that code running at 1MHz, not 50MHz, so it makes the end result look faster.

Second, there are some instructions which are simply faster on a clock-for-clock basis on the MEGA65. In particular, function calls are faster because JMP and RTS are a cycle faster each, and register ops are quite a bit faster, because things like INX are single cycle on the 4502.

Third, there are some instructions that are slower, in particular loads and things that load from colour RAM in particular. This is because all load instructions currently take one cycle more on the MEGA65 than on a real 6502 or 4502.  While rather annoying, it isn't a super high priority for me to fix right now.  The MEGA65 is already very very fast.

Back to testing the C64 compatibility modes, we have 2MHz mode. This doesn't exist on the C65, but since enough software for the C64 tries to use C128 2MHz mode, it has long since been implemented in the MEGA65. Basically, if the M65 is in C64 mode, then it emulates $D030.  In C65 mode, $D030 is replaced by the VIC-III memory banking register. Anyway, as for 1MHz mode, we see that the result is pretty much spot on.

So then I tested 3.5MHz mode. Here, things are a bit different, because the C65 tries to match 6502 cycle timing when running at 1MHz for compatibility. This means in practice that it adds a dummy cycle to a bunch of instructions that are otherwise a bit quicker on the 4502. This means that I have to have two separate table of instruction cycle counts, one for "pseudo 6502 mode" and one for "native 4502 mode".  Note that this is independent of the true 6502 vs 4502 mode select that the MEGA65 will soon be getting. Instead, it is just about adjusting the run time of the legal 6502 instructions that are a sub-set of the 4502 instruction set.

I hadn't touched all this for ages, so I wasn't particularly surprised to see some odd things in the result. First of all, NOPs were still only 3.5x faster, not 7x faster, as we would expect if NOPs were really now single cycle, as they are on the 4502.

What we do see is that all the read-modify-write instructions are now a bit faster, because the 4502 doesn't do the dummy write of the original value, as happens on the 6510. As I have discussed previously, this was one of the big sources of incompatibility on the C65, because you could not do INC $D019, ASL $D019 or any of the other read-modify-write instructions to reset a raster interrupt.  The MEGA65 has long since treated $D019 as a special case for these instructions, and then, and only then, does it spend the extra cycle doing the dummy write.

I then took a look at the instruction cycle count table in the MEGA65's CPU, and saw that I had basically copy-pasted the 6502 one, and changed only a few instructions.  So I spent a half-hour or so with, and updated my table.  With that done, magically the instruction timings were now much healthier looking:

This reminds me of the claims that were made about the 4502, including in the link above, that code could run "up to 25% faster" because of the instruction timing improvements.  What we see in SynthMark64 more or less confirms this: We get 4.25x instead of 3.5x, which equates to about at 21% improvement. Of course, this depends on the exact instruction mix you might be running. In any case, it seems that a claim of 25% speed-up is not unreasonable.

Anyway, I am now happy that the CPU speeds are now accurate enough for initial use, e.g., for working with the IEC serial bus, including most fast-loaders.

The other piece that the pictures show, but is not immediately apparent, is that we have the new 800x600 based video modes working, in both 50Hz and 60Hz. This is changeable run-time via a register, so you can select PAL or NTSC operation, and get the correct frame, and thus music and game speed interrupt rate.  There are still a few remaining wrinkles to work through on this, but it is mostly working.  Once we have the modes completely settled, I will post about it.

Sunday 10 December 2017

Automatic 4502 / 6502 Instruction Set Switching

We have known for a long time, that we need to support 6502 illegal opcodes on the MEGA65.  Initially, we thought that this would affect only a very small percentage of C64 software, however, it seems that a reasonable fraction of software has trouble with illegal opcodes. Perhaps it is one of more of the common decrunch routines.  Or it could just be that we have some subtle bug in our 4502 implementation that means some 6502 instructions sometimes go astray -- even though we pass the runnable 6502 instruction test suite for all official op-codes.

In any case, we know we need to add "6502 mode" to our CPU.  In fact, the bulk of the work was done quite some time ago, but I am only just now getting around to testing it.

Give or take getting the precise behaviour of the illegal op-codes correct, it wasn't too hard to add a second personality to the CPU: It is just some of the instruction fetch and decode logic that needed to be duplicated. Then the CPU needed a flag to indicate which mode to be in.

Then it should just be a case of working out when we are in C64 mode versus C65 mode, and setting the CPU mode accordingly, right? Unfortunately not.

This is because the C65's C64-mode KERNAL ROM uses 4502 opcodes to work out whether it should talk to the internal 1581/1565 drive, or to a drive on the IEC bus.

There are a few key parts of the routine we need to worry about (this is from t he 910111 version of the C65 ROM):

$F72C - Context switch to C65 DOS
$F83E - Context switch back from C65 DOS on return from DOS call

The context switch to C65 DOS and back are fairly similar, and worth a quick look. First, context switching to the C65 DOS:

F72C   78         SEI
F72D   48         PHA
; C65 IO / VIC-III mode enable sequence
F72E   A9 A5      LDA #$A5      
F730   8D 2F D0   STA $D02F
F733   A9 96      LDA #$96
F735   8D 2F D0   STA $D02F     
; set bit 6 in $D031 to put CPU at 3.5MHz
F739   A9 40      LDA #$40
F73A   0C 31 D0   TSB $D031     
; bank in $C000 interface ROM and remove CIAs from IO map
; so that 2KB of colour RAM is visible $D800-$DFFF
F73D   A9 21      LDA #$21
F73F   0C 30 D0   TSB $D030
; Save registers from C64 mode, so that they can be restored
; on return
F742   68         PLA
F743   8D F6 DF   STA $DFF6
F746   8E F7 DF   STX $DFF7
F749   8C F8 DF   STY $DFF8
F74C   9C F9 DF   STZ $DFF9
; Now pull the return address from the stack, and save that, too.
F74F   68         PLA
F750   8D FB DF   STA $DFFB
F753   68         PLA
F754   8D FC DF   STA $DFFC
; Remember what the stack pointer was
F757   BA         TSX
F758   8E FF DF   STX $DFFF
; Rearrange memory map:
; Map $0000-$1FFF to $10000-$11FFF 
; Map $8000-$BFFF to $20000-$23FFF 
; (C64 KERNAL stays visible at $E000-$FFFF)
F75B   A9 00      LDA #$00
F75D   A2 11      LDX #$11   ($0000+$10000)
F75F   A0 80      LDY #$80
F761   A3 31      LDZ #$31   ($8000+$18000)
F763   5C         MAP        ; activate new map
; We are now in C65 DOS memory map
; Set stack pointer to $1FF
F764   A2 FF      LDX #$FF
F766   9A         TXS
; Load the saved return address, and put it into the C65 DOS
; stack
F76A   48         PHA
F76E   48         PHA
; Restore all the saved registers
F76F   AD F6 DF   LDA $DFF6
F772   AE F7 DF   LDX $DFF7
F775   AC F8 DF   LDY $DFF8
; Return to the return address that had just been copied to this stack
F77B   60         RTS

The general procedure of this routine is quite interesting. The last few bytes of colour RAM (it is 2KB long on the C65, and is visible $D800-$DFFF in when the CIAs are banked out of the way) are used as a scratch transfer area.  The contents of the registers are saved, so that they are available to the DOS function that has been called. The only piece of mild gymnastics going on is the way that the return address of the caller is copied from the C64 stack to the C65 DOS stack.  The C64 KERNAL is kept visible, so that the RTS will continue in the C64 KERNAL routine at that point, which then makes the indirect jump into the C65 DOS to do the necessary work.  There is no risk of interrupts happening while in this mode, as IRQs and NMIs are both disabled by the MAP instruction, until a NOP instruction is executed.

This routine takes 21 ~1MHz clock cycles (26 including the JSR) before the CPU is switched to 3.5MHz.  Then 85 more clock cycles at 3.5MHz. The total cost of switching to the C65 DOS context is thus 50 micro seconds.

The return routine is similar, essentially reversing the process, although the handling of the two different return addresses at $DFFB/C versus $DFFD/E is still not entirely clear to me.

; Pop return address of caller and save, ready for copying to C64
; stack.
F83E  68          PLA
F842  68          PLA
F843  8D FE DF    STA $DFFE
; restore C64 memory map and stack pointer for the original
; C64-context caller (61 cycles at 3.5MHz)
F846  20 7C F7    JSR $F77C
; clear bit 7 in $C0, indicating not a C65 internal drive
F849  77 C0       RMB7 $C0  
F84B  6B          TZA
F84C  10 0C       BPL $F85A
F84E  A9 00       LDA #$00
; set bit 7 in $C0, indicating a C65 internal drive
F850  F7 C0       SMB7 $C0 
; Push return address back onto C64 stack 
F855  DA          PHX
F859  DA          PHX
; Set bits in $90 (status) if required from A
F85A  04 90       TSB $90   
; Restore CPU registers
F85C  AE F7 DF    LDX $DFF7
F85F  AC F8 DF    LDY $DFF8
F862  AB F9 DF    LDZ $DFF9
F865  AD F6 DF    LDA $DFF6
; bank out $C000 ROM and bank CIAs back in.
F868  48          PHA
F869  A9 21       LDA #$21
F86B  1C 30 D0    TRB $D030
; return CPU to 1MHz. 
F86E  A9 40       LDA #$40
F870  1C 31 D0    TRB $D031
; return to VIC-II mode
F873  8D 2F D0    STA $D02F
; Re-restore Accumulator
F876  68          PLA
; Re-enable IRQ & NMI after MAP change made in call to $F846
F877  EA          EOM       
F878  58          CLI
F879  18          CLC
F87A  60          RTS

Because the C65 IO mode and 3.5MHz CPU mode is already active, the cost is 56 cycles at ~3.5MHz = ~16 micro seconds for the call to $F77C, plus 70 cycles at 3.5MHz (= ~20 micro seconds) and 15 cycles at ~1MHz for the routine itself.  The total cost for switching from C64 mode to C65 DOS and back again -- without actually doing any work is thus 50 microseconds for the switch to the C65 DOS context, plus 16 + 20 + 15 = 51 microseconds to switch back, i.e., ~0.1 millisecond.  (I have counted both code paths for the stack restoration, as I am not entirely sure what is going on there.  However, that only makes about 5.5 microseconds difference.)

This convoluted process explains why the C65 DOS is so incredibly slow, achieving less than 2KB/second, even though the internal floppy drive can theoretically read at 30KB/sec, as there is a complete and time-consuming context switch whenever a byte is read from the internal floppy drive. I've been tempted to patch the C65 DOS to either support an efficient LOAD system call, or to implement some sort of buffering for sequential reads.  I've even thought about teaching the CPU about these context switch routines, and basically having a special case for them, that does the context switch in just one or a few cycles. However, that is a job for another day. In the meantime, on an M65, running the CPU at 50MHz is an easy interim solution, as loading can happen at some tens of KB/second, even with this ineffeciency.

Meanwhile, back on the topic of CPU personality selection, those two routines are not particularly troublesome, as they only use 4502 opcodes after using $D02F to enable C65 / VIC-III IO mode, which we can use as a reliable clue to switch the CPU to 4502 mode.  This gives us our first rule:

(1) If the C65 / VIC-III (or M65 / VIC-IV) IO mode is enabled, the CPU should always be in 4502 mode.

The routines to enter the various DOS calls are, however, a little troublesome.   They are all very similar, so I present just one of them here as an example:

F7E4 FF C0 09 BBS7 $C0,$F7F0 ; Is current device C65 DOS?
F7E7 20 C7 ED JSR $F72C ; Context switch to C65 DOS memory map
F7EA 22 0A 80 JSR ($800A) ; Call C65 DOS TALK routine
F7ED 20 3E F8 ; Context switch back to C64 memory map
F7F0 4c C7 ED JMP $EDC7 ; Send TALK on IEC bus in all cases

The instructions in bold are 4502 instructions.

Opcode $22 is normally a KIL instruction on the 6502, so we could safely make that do the new indirect JSR at all times, without great risk. $FF is normally ISC $nnnn,X, where ISC is the combination of INC and SBC. I have no idea if people find uses for that opcode.  However, we don't need to worry about that, because these opcodes only exist in the C64-mode KERNAL on the C65.  Thus, we can simply switch the CPU to 4502 mode whenever executing code in the KERNAL in C64 mode.  This gives us our second rule:

(2) When executing code in the KERNAL (ROM at $E000-$FFFF), the CPU should always be in 4502 mode.

So, if the CPU were in 6502 mode in C64 mode, then these two rules would ensure that the C65 internal DOS would work.  So that's good.

Now the trick is we just need to work out when we are in C64 mode and when we are in C65 mode.

First, a C65 starts up in C64 mode, and then escalates to C65 mode if it doesn't see why it should stay in C64 mode. That is, the machine always starts in C64 mode.  Fortunately, the switch to C65 mode does enable C65 / VIC-III IO mode, so it is possible that we need no further rules.

Finally, when the Hypervisor is running, the CPU should always be in 4502 mode.

(3) When executing code in Hypervisor mode, the CPU should always be in 4502 mode.

So, in theory at least, that should be all we need to automatically set the CPU personality, in a way that maximises compatibility, i.e., where C64 programs get a 6502-compatible CPU, unless they ask for something different, but the C65's C64-mode KERNAL runs in 4502 mode, so that the rather ugly DOS inter-process communications can still happen.

The present state of play, is that this is all implemented, but not yet enabled by default or tested on the MEGA65.  This is one of the jobs on my list for the coming week.