Showing posts with label GS4510. Show all posts
Showing posts with label GS4510. Show all posts

Saturday, 8 November 2014

Easily Accessing All of Memory

As I have begun work on writing a general-purpose Unicode string renderer for the C65GS, it has got me right at the coal face of developing software to run on this machine, as compared to just making the machine. Creator and user have rather different needs and experiences.
Until now, I had simply assumed that I would use DMA and memory banking to provide access to all of memory.  Technically it provided everything that one could need.  Once I started to write software for the machine, it became immediately apparent that these methods were fine for accessing slabs of memory, but would be rather inconvenient for the normal use case of reading or writing some random piece of memory somewhere.
What I found was that if I had a pointer to memory and wanted to PEEK or POKE through that pointer, it was going to be a herculean task, and one that would waste many bytes of code and cycles of CPU to accomplish -- not good for a task that is the mainstay of software.
I was also reminded that the ($nn),Y operations of the 6502 are essentially pointer-dereference operations.  So I thought, why don't I just allow the pointers to grow from 16-bits to 32-bits.  Then one could just use ($nn),Y or ($nn),Z operations to act directly on distant pieces of memory.
Slight problem with this is that the 4502 has all 256 opcodes occupied, so I couldn't just assign a new one.  I would need some sort of flag to indicate what size pointers should be.  This had to be done in a way that would not break existing 6502 or 4502 code.
The experience of the 65816 led me to think that a global flag was not a good idea, because it makes it really hard to work out what is going on just by looking at a piece of code, especially where instruction lengths change.
So I decided to go for a bit of an ugly hack: If an instruction that uses the ($nn),Z addressing mode immediately follows and EOM instruction (which is what NOP is called on the 4502), then the pointer would be 32-bits instead of 16-bits.
While ugly, it seems to me that it should be safe, because no 6502 code uses ($nn),Z, because it doesn't exist. Similarly, there is so little C65 software that it is unlikely that any even uses ($nn),Z, and even less of it should have an EOM just before such an instruction.  
In fact, in the process of implementing 32-bit pointers, I discovered that ($nn),Z on the 4510 was actually doing ($nn),Y, among other bugs.  So clearly the C65 ROM mustn't have even been using the addressing mode at all!
Here is the summary of how this new addressing mode works in practise.  The text below is as it appears in the C65GS System Notes which is being developed with the help of the community.

32-bit Memory Addresses using 32-bit indirect zero-page indexed addressing


The ($nn),Z addressing mode is normally identical in behaviour to ($nn),Y other than that the indexing is by the Z register instead of the Y register.  That is, two bytes of zero-page memory are used to form a 16-bit pointer to the address to be accessed. However, if an instruction using the ($nn),Z addressing mode is immediately preceded by an EOM instruction, then it uses four bytes of zero-page address to form a 32-bit address.  So for example:

zppointer: .byte $11,$22,$33,$04

ldz #$05
eom
lda ($nn),Z

Would load the contents of memory location $4332216 into the accumulator.

LDA, STA, EOR, AND, ORA, ADC and SBC are all available with this addressing mode.

Memory accesses made using 32-bit indirect zero-page indexed addressing require three extra cycles compared to 16-bit indirect zero-page indexed addressing: one for the EOM, and two for the extra pointer value fetches.

This makes it fairly easy to access any byte of memory in the full 28-bit address space.  The upper four bits should be zeroes for now, so that in future we can expand the C65GS to 4GB address space.

Wednesday, 15 October 2014

Confirmed that I have fixed the sneaky CPU bug

This morning after synthesis of the fix for the sneaky CPU bug fix, I had the chance to test it out.

Rayne's interlace test program now works, and his MUIFLI program is also closer to working, although it isn't showing the right data. But that could be due to FLI not working on the VIC-IV -- yet to be confirmed.

However, what it did also fix is BoulderMark.  So I can now present the latest result for the C65GS with this benchmark:


Notice that now the sprite appears (and that the sprite sitting in the border is also visible because sprites currently sit in front of the border).  Otherwise the display is just about perfect.  This image was captured via the VNC server video streaming interface (search previous posts to find out more about this).

Anyway, this all equates to 94x NTSC C64 or almost exactly 100x PAL C64.  Of course as I have mentioned before, BoulderMark is non-linear with fast accelerators, and so the real performance is much more likely to be the roughly 44x that SynthMark64 reports.

Tuesday, 14 October 2014

Found a sneaky CPU bug

While trying to run some graphic test programmes supplied by Rayne, I found that the CPU was mis-behaving in a way that reminded me of the bug I was seeing with BoulderMark, and probably Lemmings as well.  Basically all was fine until a raster interrupt occurred, and then things would go odd or outright crash.  What was extra odd was that BoulderMark would still run on the FPGA at work, but not on the one here at home, which shouldn't happen -- FPGAs shouldn't be picky like that.

Anyway, Rayne's programmes are much simpler, and offered the prospect of easily debugging what was going on.

So after a bit of poking around I discovered that the C65GS would go to lala-land after INC $D019.

This got me thinking, because $D019 is special in my CPU, because it adds a dummy write for RMW instructions that touch $D019, but not any other address.  This is to avoid wasting a CPU cycle on the dummy write of the original value back to memory, except when required for C64 compatibility.

The lack of this dummy write on the C65, that acts to clear the VIC-II interrupt on a C64, is one of the major sources of incompatibility between the C65 and C64, and stops the majority of software from running on it.  Thus I had gone to special effort to make sure it wouldn't be a problem on the C65GS, but without the CPU speed penalty of doing it on every address.

However, I had messed up the dummy write state in the CPU: it was not setting the target address on the bus, and so instead was writing to the last accessed memory location, which was the first byte of the following instruction.  The net result is that the old contents of $D019, usually $F0 or $F1, would get written to the next byte in the instruction stream.  I confirmed this in simulation, where the dummy write and final write can be seen marked in bold.  Note that the dummy write is going to $F60F, not $D019!

gs4510.vhdl:1685:11:@700ns:(report note): MEMORY reading $FFFF60C = $EE
gs4510.vhdl:1004:7:@700ns:(report note): MEMORY long_address = $FFFF60D
gs4510.vhdl:1685:11:@780ns:(report note): MEMORY reading $FFFF60D = $19
gs4510.vhdl:1004:7:@780ns:(report note): MEMORY long_address = $FFFF60E
gs4510.vhdl:1685:11:@860ns:(report note): MEMORY reading $FFFF60E = $D0
gs4510.vhdl:1004:7:@860ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@940ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1004:7:@940ns:(report note): MEMORY long_address = $FFD3019
gs4510.vhdl:1685:11:@1020ns:(report note): MEMORY reading $FFD3019 = $70
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1304:9:@1100ns:(report note): writing to shadow RAM via chipram shadowing. addr=$000F60F
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $000F60F <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0  inc  $D019         A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00   ..E-.I..   

So a quick fix and re-run simulation and suddenly we can see that it is all fixed:

gs4510.vhdl:1685:11:@700ns:(report note): MEMORY reading $FFFF60C = $EE
gs4510.vhdl:1004:7:@700ns:(report note): MEMORY long_address = $FFFF60D
gs4510.vhdl:1685:11:@780ns:(report note): MEMORY reading $FFFF60D = $19
gs4510.vhdl:1004:7:@780ns:(report note): MEMORY long_address = $FFFF60E
gs4510.vhdl:1685:11:@860ns:(report note): MEMORY reading $FFFF60E = $D0
gs4510.vhdl:1004:7:@860ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@940ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1004:7:@940ns:(report note): MEMORY long_address = $FFD3019
gs4510.vhdl:1685:11:@1020ns:(report note): MEMORY reading $FFD3019 = $70
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1304:9:@1100ns:(report note): writing to shadow RAM via chipram shadowing. addr=$000D019
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $000D019 <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0  inc  $D019         A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00   ..E-.I..   

Oops.. not actually all fixed.  It is now writing to $D019 in RAM, not IO.  Lucky I decided to write this blog post, or I wouldn't have spotted that I still had the memory write flags slightly messed up.  Specifically memory_access_resolve_address wasn't asserted, so the 16-bit address was not being translated to the physical 28-bit address.  Fix that and try again:

gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $FFD3019 <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0  inc  $D019         A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00   ..E-.I..

Ah, that's better!

Now to resynthesise, and see if BoulderMark, Lemmings and Rayne's MUIFLI all work properly.

Thursday, 4 September 2014

Boulder Mark at 48MHz, with working sound

Boulder Mark seems to work now, even though I can't think what I might have fixed that would have addressed the problem it was having.  It does still crash occasionally, but it usually runs to completion.  

So I thought I would post a video of boulder mark running on it to give an idea of how fast the C65GS is.  That turns out to be about 500 frames per second in this test, which is very nearly 100x the score of a stock C64.  The vacuum-like sound is boulder mark making a noise every frame.


As I have mentioned before, boulder mark is a non-linear benchmark, because the first few hundred frames are much more work to draw.

So it is probably a better comparison between the C65GS and Chameleon, since both are fast enough that the slow frames are only a small fraction of the score.

On that basis, the C65GS is about double the speed of the Chameleon, rather than the 4x that SynthMark would suggest.  This makes sense, since SynthMark places a fairly heavy weighting on operations that read from IO and colour RAM, even though they are relatively rare operations.  So, depending on the workload, the speed difference is likely to be somewhere between 2x and 4x, if using normal 6502 opcodes.

Friday, 8 August 2014

De jevu: C65 startup banner and dead BASIC, again.

The title says it all: The 48MHz CPU is almost working, but there is clearly something fruity going on.  Sadly it isn't the same issue as when I got to this point last time, but I have some extra clues, like BASIC in C64 mode also behaves similarly. There are also some other nuisances, like IRQs don't seem to be getting masked properly.  But it is getting pleasingly close.

I am also interested to know whether people think the display is better with little side borders, or whether people would prefer no side borders in 80 column mode.  I personally think that having little ones adds to the authentic feel, but without sacrificing acres of screen.  Also, the real C65 has quite narrow side borders compared with the C64.  But I would like to hear what you think


Oh, and if you got this far, you might be able to spot the display glitch that I have to fix.  If you can, and can accurately explain the cause (I already know it), I'll greet you on the next screen shot I post on the blog.

Sunday, 27 July 2014

First speed test of 48MHz CPU

I am continuing to fight with getting the reimplemented CPU and VIC-IV all settled down, however things are getting much closer.

As the following image implies, it can run the C64 ROM (not yet C65 ROM -- it gets stuck in the DOS somewhere).  It should be noted that the CPU performance here is not final, and some instructions might end up faster or slower than depicted here.  That said, the CPU is certainly quite a bit faster than the old 32MHz one.


44.36x is almost exactly 48mhz/32mhz = 150% the speed of the old CPU at 28.93x.  Pleasingly this is before I do anything to optimise the performance.  Also, whereas the old CPU filled the FPGA to capacity, with the new CPU about two-thirds of the FPGA remains free -- space for implementing sprites, a 1541 and other goodies as I get the chance.

Speaking of optimisations, one that I may attack in the not too distant future is a stack cache that allows RTS to execute in just 1 cycle.  While it will make some impact on the SynthMark64 score, it is more interesting for real-life work loads where JSR/RTS are very common instructions.  But otherwise there is no caching anywhere in this -- it is all raw, predictable cycle times, which helps make it feel like a simple 8-bit computer, albeit a very fast one.

What isn't entirely obvious here is that keyboard input has broken for some reason, with my PS/2 keyboard reader failing to detect key-release events.  Also, for some reason the CIA interrupts are not always happening as often as they should.  Combined with not being able to use the C65 ROM, this means I had to side-load SynthMark64 via the serial monitor, start it directly from the serial monitor, and then use the serial-monitor to stuff the ENTER key-press into the keyboard buffer.  So there is still a bit to go, but at least it feels like I am getting somewhere.

Saturday, 26 July 2014

Debugging the new CPU

Debugging of the re-implemented CPU continues, and is hopefully close to complete.  I thought I would describe some of the process I have followed.

The real secret to debugging anything is discoverability, that is having the means to work out what is happening so that you can tell not only whether the result is correct, but how the result is being calculated.

With a hardware design this can be rather annoying, because the time it takes to make a trivial change, resnythesise and test the design can be of the order of an hour.  Assuming that you correctly expose the thing you are trying to debug, that means you can examine (and hopefully fix) at most a dozen or so defects per full day of effort. Not Good.

Fortunately there are simulation tools for VHDL that let you debug without having to go through the whole synthesis process, and thus reduce the time to examine a defect from hours to minutes.  While this has limitations, for example, to debug the SD card interface in simulation I would need to write an SD card simulator, it is extremely useful, and I have made extensive use of the free and open-source simulation tool, ghdl.

The processor redesign basically consisted of gutting out the first implementation of the CPU and leaving just the shell that accesses the memory and interfaces with the serial monitor, which I have described in previous posts.  The serial monitor is extremely useful, because it allows reading and writing of all memory, as well as examining the processor state, and single-stepping the processor.

The first part was to re-do the serial monitor interface, because this needed an overhaul for the new processor architecture.  This was rather tricky, because simulating a serial connection feeding various commands in would take a fair bit of work, and the time scales of serial input means that simulation would be rather slow anyway.  So as a result I used some of the LEDs on FPGA board to provide some useful debugging output, and worked as carefully as I could to make sure that the code was likely to work.

The second and related step was getting the memory access stuff working again, and accessible via the reworked serial monitor interface.

These two steps took much longer than I had hoped, and were really frustrating.  In retrospect, it might well have been easier to make a simulator for serial input and used ghdl simulation to shorten the process a bit.

After this, I set about implementing a few simple instructions so that I could get single-stepping of the CPU through the serial monitor working.  This also turned out to take way longer than I would have liked, partly because the new CPU architecture uses 6502-style end-of-instruction pipelining which really complicated single-stepping.  I did get it working in the end.

Then it was on to implementing LDA, STA, JMP and a few other instructions to allow the writing of simple little test programs to confirm that the CPU was generally working.  At this point ghdl was useful to allow quick testing of the instructions and their interactions.

In the process of doing this, I realised that the debug output I was producing in ghdl was not as good as it could be.  Basically I was looking at hexadecimal instruction bytes and trying to decide if it was right or not.

It would be much easier to debug if I could get ghdl to show full instruction disassemblies as well, so in stead of just seeing 8D 0D DC, it would also show STA $DC0D.  Also, it would help enormously to know what memory access was happening each cycle, so that I could get an idea of exactly where an instruction was going astray.

I finally had time to implement this during the week, and now I can easily get output like:

MEMORY reading $FFFF654 = $A9
MEMORY reading $FFFF655 = $00
MEMORY reading $FFFF656 = $85
$F654 A9 00     lda  #$00          A:00 X:22 Y:33 Z:00 SP:01FF P:26 $01=3F  ..E-.IZ.  
MEMORY reading $FFFF657 = $20
MEMORY reading $FFFF658 = $A9
MEMORY reading $FFFF658 = $A9
MEMORY writing $0000020 <= $00
$F656 85 20     sta  $20           A:00 X:22 Y:33 Z:00 SP:01FF P:26 $01=3F  ..E-.IZ.  
MEMORY reading $FFFF659 = $91
MEMORY reading $FFFF65A = $91
MEMORY reading $FFFF65A = $85
$F658 A9 91     lda  #$91          A:91 X:22 Y:33 Z:00 SP:01FF P:A4 $01=3F  N.E-.I..

Actually the output has a little more information in it, but the above gives you an idea.

We can see a few things from this output.

First, the instructions seem to work, as we see the right values end up in the accumulator, and the correct value being written to the write address.

Second, we can see that there is a dummy read in STA, which is part of the design that allows 48MHz operation.  So for some instructions at least, we don't expect 48x performance.  Some of these might get improved down the track, but some penalty cycles will have to remain.

Thirds, we can see the 6502-style pre-fetching of the next instruction while the previous instruction is finishing off.

Armed with the ability to produce this kind of trace, I used the TTL6502 test program for 6502 processors, and by examining the simulation output was able to quickly find and fix quite a number of bugs.

The TTL6502 program only tests the original 6502 instructions, not any of the 4502 extensions.  So I have followed a bit of an ad-hoc process of writing little programs that use each of the new instructions, and verifying from the memory trace, register and flag values that all is well.  This has also turned up a great many bugs.

This is more or less where I am at now, fixing bugs with PHW (push word, immediate or absolute) and a few other remaining instructions.  Once that is done, we should hopefully be back to being able to boot the C65 ROM into C64 mode, and then soon after running SynthMark64 to get an idea of the speed of the new CPU.

Thursday, 17 July 2014

More work on new CPU, and some very skinny raster stripes

It still isn't very exciting to look at right now, but the CPU is getting closer to working properly.

I have found and fixed abut in the TRB (Test Reset Bit) instruction. This is a handy little instruction for clearing bits in byte.  For example, LDA #$01 / TRB $D030 will clear bit 0 in $D030, which on a C65 will bank out the second kilo-byte of colour RAM from $DC00 - $DFFF so that you can see the CIAs again.  The correct calculation for the result is (memory and (not A)), but I had (memory and A), which has the effect of reseting all of the bits except the one(s) you wanted reset.  Needless to say that wasn't working too well.

I also fixed some bugs with IO mapping.  In particular, the SD card controller is visible to the CPU again, and Kickstart even gets as far as loading the master boot record from the SD card.  There does seem to be an out-by-one error with the buffer addresses, such that the whole sector is rotated by one.

Here is Kickstart finding the SD card at 48MHz:


That looked a bit boring, so I wrote a little loop to do some raster effects:

This is the little loop:

loop     LDA $D052   ; VIC-IV physical raster line low bits (range 0 -  1199)
         CMP $D052
         BEQ *-3
         INC $D020
         DEC $D020
         JMP loop

  The loop should increment and decrement $D020 just once at the start of each raster line. However it looks like the compare instructions are using a fixed value, instead of the operand, which is why there are a few rasters on which there are no bars, while the rest fail to properly compare the raster number with itself.

This is due to a bug in the compare instructions, which I have yet to get to the bottom of. My gut feeling is that it is some sort of timing bug, where the wrong value is read from the bus.  I have seen it in simulation once or twice, which suggests that I should be able to analyse it fairly easily to find and fix the cause.

Meanwhile, it is interesting to look at the pattern and how narrow the stripes are.  They are actually almost half the width that they seem at first when you look closer, because of the adjacency of the bars on successive raster. The following image makes this a bit clearer:



The VIC-IV runs at 4x the CPU clock, so every four physical pixels corresponds to once CPU clock tick.  The logical pixels of the character generator are five logical pixels wide here, so one and a quarter CPU clocks wide.

INC and DEC take seven cycles on my CPU at the moment, due to the need to include wait-states to avoid back-to-back memory accesses at 48MHz.  This should equate to 7x4 = 28 pixels, or about five and a half logical pixels, just over half a character wide, which is pretty much what we are seeing.

On a real C64 the same bars would be almost 10x wider, at six characters or 48 pixels wide.  So even allowing for the massively higher pixel clock on the C65GS (192MHz versus 8MHz on the C64), there is certainly scope to do some pretty interesting tricks.  Vertical raster bars and split screens should both be quite possible, although there are probably easier ways to get the same effects.


Wednesday, 16 July 2014

New CPU is working (sort of)

It's taken much longer than I would have liked, but I have the redesigned CPU mostly working now.

The CPU is running at 48MHz, and should be about 40x C64 speed, although the exact figure is likely to change.

The reason it isn't 48x is that I have had to put some wait-states in a few places to make the timing work.

Reading from anything other than fast RAM incurs one extra cycle, which means reading from IO currently has a two cycle penalty.  Fastram, as the name suggests, has no wait states.  Writing to IO also has no wait states.

Also, anywhere where the CPU makes a memory access for which the address or data is dependent on whatever has just been read from memory, this has had to be split into two cycles.  This mostly affects the Read-Modify-Write (RMW) class of instructions, like INC, DEC, ASL and ROR.  This basically means that we have a dummy cycle similar to what the real 6502 has.

Unfortunately, it isn't very practical to perform the dummy write that the 6502 does, so I will need to add an extra cycle for $D019 so that DEC $D019 and variations work for clearing interrupts.

I have some ideas for caching the top of the stack so that RTS can execute in a single cycle, which will provide a solid boost for many programs, but that's some way down the track, because I need to get the CPU working properly first.

The screen shot from simulation below shows that it can run the kickstart ROM and get as far as trying to find the SD card:


The astute observer will notice that the top line of the display is showing the wrong contents.  This is because the bad-line for that row of characters had already occurred.  If I leave the simulation long enough that it can draw a 2nd frame, then it should show the kickstart banner.  As it happens, the simulation managed to draw another frame while I was writing the this post, so you can see the real version below:


I need to shake down the remaining bugs like this in the VIC-IV that have crept in with the substantial rework that it has suffered while I have been doing the CPU.  Both efforts, CPU and VIC-IV rework, are really targeted at making the whole thing use much less of the FPGA so that I have enough space to implement sprites and the other missing functionality.

In any case, the fact that it can set the video mode, clear the screen, and decide that it is looking for the SD card shows that an awful lot of the CPU is actually working.  There are some bugs, however.

First, I haven't finished implementing BRK or interrupts of any sort.

Second, I haven't finished implementing the PHW (push word, either immediate or absolute) instruction.  It won't be hard, but it just hasn't hit the priority queue yet, and it is a little weird, since in the CPU the two addressing modes will likely have very different implementations.

Third, there are some weird bugs with accessing IO.

The SD card controller and other IO functions provided by that module aren't mapping in the address space properly when run on the FPGA, even though they simulate fine.

Also, running the following little routine to draw a rough vertical raster bar locks up as soon as the accumulator has the value $F0.  Once that happens, the Z flag stays perpetually set, and so nothing more gets drawn.

loop    LDA $D012
        CMP $D012
        BEQ *-3
        INC $D020
        INC $D020
        JMP loop

It works fine, however, if I put a NOP between the CMP and the BEQ.  So there is something timing dependant going on.  What is weird is that without the NOP the bug manifests, even if the CPU is in single-stepping mode.

This reworking of the VIC and CPU at the same time hasn't been the most fun, because it has gone backwards from working to a seething mess.  But it is now finally starting to draw back together, and should hopefully soon catch up with where the old excessively large design got to.  Then comes the fun part of adding sprites and other goodies, but that will still be a little while off.

Saturday, 21 June 2014

Another speed bump. Now close to 30x C64 speed

This evening I had a few minutes to implement the next IPC improvement I had in mind.  This one is just implementing the simple end of instruction pipeline for instructions where it is possible, the same as I have already done for single byte instructions, and the same as what the real 6502 has always done.

The result is a nice little speed up as the pictures show.



This is among the last of the speed ups that I will do before a substantial reimplementation of the CPU to make it table driven.  Using a table reduces the FPGA logic consumption a lot, and also has the potential to allow the CPU speed to be increased quite a bit, hopefully to 64MHz or even 96MHz all going well.  But it is really the logic reduction that matters, so that I have space to implement the missing features in the C65GS.

Friday, 20 June 2014

A speed improvement

Reflecting on the recent benchmark results, and especially that the revision 9 Chameleon is almost exactly the same speed as the C65GS in the bouldermark benchmark, I wondered if there was any low-hanging fruit I could tackle to increase the speed of the C65GS.

The main slow-down with the C65GS is the wait-state on reading chipram.  I had tried various ways to supress the wait-state at its root cause in the FPGA dual-ported block RAM without luck.  Then it occurred to me this morning that I could make a single-port shadow RAM that shadows all of chipram.  So writing to chipram writes to both, and reads by the CPU would be sourced from the chipram -- with no wait-state.

So as a reminder of the state of affairs before todays improvements:


Removing the wait state on chipram by implementing the shadow RAM had quite a nice impact:


Functional calls are about 30% faster, and RAM operations in general are all moderately improved, as might be expected.  This also got bouldermark quite a bit faster.


In the process I realised what should have been obvious to me, that implied/accumulator mode single-byte instructions were still taking two cycles, and could be easily reduced to one cycle.  This makes NOPs run at an amazing 71x, and pushed the overall rating up a little to 26.9x:


BoulderMark now indicates just over 55x.  I am still at a loss why the machine is so much faster than a stock C64 for BoulderMark, but the same phenomena is visible with the latest version of the Chameleon that gets a rating of around 14,000 (see http://wiki.icomp.de/wiki/C64_Benchmarks).  That's a mystery that will have to remain for now.


In the meantime, I have a couple more ideas to improve performance that I will try.

Sunday, 4 May 2014

BASIC now mostly working in C65 mode

After a bit of poking around the C65-mode BASIC interpreter I worked out that I had forgot to implement PHW absolute mode (immediate mode was already working).  So after a quick fix, BASIC in C65 mode is now superficially working, as can be seen here. GO64 also works fine to drop to C64 mode.


However, problems still remain. When I tried to get the disk directory, it reads the first part, and then stops, and further attempts to read the directory fail.  The same thing happens if I use LOAD"$",8.  This is quite odd, because the same thing works just fine in C64 mode.  Curiously, if I load a program before trying to load the directory, that works okay.  But if I try it after it fails.  It might be that there are still missing instructions that are used in the C65 DOS access routines.  I really need to finish my suite of 4510 instruction tests.


Also, as can be seen, once the screen scrolls, the copying of the colour memory goes a bit weird.  This uses the DMA controller in copy mode.  Hopefully a bit of poking around with that will reveal the nature of the problem.

But in the meantime, a happy piece of progress that BASIC is now responsive.

Saturday, 8 February 2014

Testing the 4510 CPU

As mentioned previously, I am now working on extending the C64 Emulator Test Suite to create a C65 emulator test suite.  In the first instance, this is all about testing coverage of the extra 4510 operations.

The existing tests in the C64 Emulator Test Suite are good, but the assembly code for them is poorly documented, and being written using Turbo Assembler on a real C64, they are 95% shared code.

I have tried to improve this situation by factoring out a lot of the common parts of the tests, and using the .include directive of Ophis, so that a complete test is <100 lines, and each of the includes and parts of the test itself are fairly well documented.

For example, the test for PLZ is:

  .include "test_top.a65"

         .byte 145,"PLZN"

  .include "test_prepare.a65"

         ; perform one test
next:     ; expect data byte to be in results
         lda db
         sta dr

         ; expect Z and data byte should be identical
  lda db
         sta zr

         ; expect A to be the same in the results
         lda ab
         sta ar

         ; expect x to be the same in the results
         lda xb
         sta xr

         ; expect Y to be the same in the results
         lda yb
         sta yr

         ; expect processor flags to have B flag set and E flag set
         lda pb
         ora #110000
  and #%01111101
         tax
         lda dr
         cmp #0
         bne nozero
         txa
         ora #000010
         tax
         lda dr
nozero:  asl
         bcc noneg
         txa
         ora #%10000000
         tax
noneg:   stx pr

         ; expect SP to be one more than it started
  ; but in practice, the value will be the same, because we will be pulling
  ; a byte off that we have pushed
         ldx sb
         txs
         stx sr

  ; push data byte onto stack to be pulled off by the instruction
  lda db
  pha  

  .include "test_setup.a65"

         ; test instruction
cmd:     plz

  .include "test_record.a65"

  ; for stack pull instructions, data value is the appropriate register
  lda za
         sta da

  .include "test_check.a65"

  ; name of next test
name:    .byte "PHWIL"

  .include "test_common.a65"

I am sure there will be further changes before it is all done and dusted, but I can now fairly rapidly write tests for the simpler instructions.

Those interested can check out http://github.com/gardners/4510tests

If anyone with a real C65 is willing to run the tests on their machine in C64 mode, that would be a great help to verify that the behaviour I am implementing is correct.  I can provide a D81 file of the current set of tests if anyone is able to help out.

An assembler for 4502/4510 CPUs

As I have progressed, the need has arisen for an assembler that can assemble 4510 instructions.

I have extended the Ophis 6502/6510/65c02 assembler to include support for the 4510.  This is a nice assembler written in python that is quite powerful and flexible.  The source code is at http://github.com/gardners/Ophis.  I have issued a pull request to the upstream Ophis distribution so that it will hopefully get included in the main Ophis distribution in time.

Now that I have a 4510 assembler, I can start extending the Commodore 64 Emulator Test Suite to include tests for all the 4510 opcodes, and prune out the ones for undocumented 6510 opcodes.

Once that is in place, and the tests pass on the FPGA, I will know that I have a working 4510, which will remove a significant unknown when testing booting with C65 ROMs.