Monday 15 December 2014

RAM Upgrade

Someone has just pointed out to me that he Nexys4 board that the C65GS project is based on is about to receive a modest upgrade in the form of switching from the 16MB cellular RAM to a 128MB DDR2 RAM module:,1301,1319&Prod=NEXYS4DDR

Reading through the new reference manual for the Nexys4 DDR, I discover the following:

The Nexys4 DDR is an incremental update to the Nexys4 board. The major improvement is the replacement of the 16 MiB CellularRAM with a 128 MiB DDR2 SDRAM memory. Digilent will provide a VHDL reference module that wraps the complexity of a DDR2 controller and is backwards compatible with the asynchronous SRAM interface of the CellularRAM, with certain limitations. See the Nexys4 DDR page at for updates.
Furthermore, to accommodate the new memory, the pin-out of the FPGA banks has changed as well. The constraints file of existing projects will need to be updated. 
The audio output (AUD_PWM) needs to be driven open-drain as opposed to push-pull on the Nexys4.
So overall the changes look relatively small, and it shouldn't be hard to port the design to the new board.

The actual DDR2 memory part is a MT47H64M16HR-25:H, which is capable of 667MB/sec, but according to the reference manual the practical limit will be 650MHz.  Xilinx have a memory interface generator which should make it fairly easy to interface to this RAM, although DDR memory is inherently more complex.

I will likely implement a very small cache, perhaps 8 or 16 bytes, because although the DDR2 RAM has a bandwidth of 650MB/sec, the random-access latency of DDR memory is typically fairly high, and it will be important to avoid incurring this latency on every slowram memory access.

Taking a quick look at the data sheet for the DDR RAM, it looks like the actual speed is 650MT/sec, i.e., 650 mega-transactions per second. I have done some reading on DDR and DDR2 RAM lately, but will need to sit down with the actual memory specifications to work out what this all means in practice.

What is almost certain is that the DDR2 RAM could drive a linear frame buffer if I wanted to, but as I have discussed on the mailing list, a linear frame buffer doesn't fit my definition of an 8-bit style computer.  However, there is already 128MB of address space already reserved for slowram expansion, so it will be quite straight-forward to incorporate it.

Friday 12 December 2014

LCD Screen Test

I had a little time this evening so I unpacked the LCD panel and driver and hooked it all up to see if it works.  The beginning looked something like:

 The VGA lead is connected to the C65GS FPGA board. And again from above:

So then I tried to turn it on, which, pleasingly resulted in the back-light illuminating, but sadly not in any picture.

So flip the panel over and check that everything is properly connected with the data cable:

The panel end of the cable wasn't quite in properly, but once it was then things were happy:

The next steps for laptopification are now likely to be in the direction of designing a 3D CNC millable case, that can hold the screen, FPGA board, keyboard and other components.  This is where I step outside my current expertise.  If anyone is interested in helping to design the case, I would welcome their assistance.

Some parts for laptop prototyping have arrived

I have been away on holidays with the family for a couple of weeks, so haven't really touched the C65GS during that time.  

However, just before we flew out I ordered a 15.4" 1920x1200 LCD panel and driver circuit from China to use for laptop prototyping. China air-mail is outrageously cheap to Australia if you don't mind waiting 2 - 6 weeks for delivery.  Since I was going to be away, I figured I could wait.  

So after letting our local post office know that they might need to hold the parcels a bit longer than usual in the unlikely chance that they would turn up early.  

And turn up early they did -- at around 2 weeks including the dispatch time from the suppliers.  Of course, in theory air mail (and sea mail for that matter) from China to Australia should be fast, since we are in the same general part of the world, and their are multiple direct China-Australia flights and sea freight routes.

Here are the two boxes before un-packing:

First, I opened the little one which contains the panel driver:

Then the big one with the 15.4" panel:

The panel is a bit smaller than I expected, being only just larger than the dimensions of my 13" MacBook Pro.  The difference is mostly due to the really fat bezel on the MacBook Pro.

This all makes me happy, because I really want the C65GS laptop to be as compact as possible, especially since I know that it will have to be about 4cm - 5cm thick to fit the full-travel keys, FPGA board and other components in.

I got my genuine C65 keyboard out to see how it compares in width: The screen is only a couple of centimetres wider than the keyboard, so just about perfect.

The next step is to get a 12v UBEC or other power supply so that I can hook the panel and driver up and test that they work.

Saturday 22 November 2014

Working on the anti-aliaser

I finally got around to plumbing the alpha-blender together to begin testing it for use with anti-aliasing text.

Not surprisingly, there are some things that aren't quite right. Primarily, Xilinx's reference implementation of an alpha blender doesn't quite seem up to the 192MHz pixel clock.  In particular, the blue channel is failing to meet timing closure, and seems to be delayed by a few pixels, as can be seen in the screen shots here:  

The displacement is a fixed number of physical pixels, as can be seen if I change the horizontal scaling of the text generator:

The timing failure isn't too surprising, given that the alpha blender uses a double-clock rate to multiplex the DSP operations to save a bit of silicon.  However, it means that the blender is effectively running at 192MHz x 2 = 384MHz. 

I think I will have to modify the blender so that it doesn't need to multiplex DSP blocks, and can hopefully then meet timing, and avoid the colour planes separating like this.

Wednesday 19 November 2014

Proportional font glitches fixed

It is almost a week since I fixed some underlying faults in the hardware support for drawing proportional fonts, together with some bugs in my font file generator that was trimming the right-most pixels from some glyphs.

However, it is only tonight that I have managed to get the display working, because the character generator logic in the VIC-IV would seize up when trying to display variable-width full-colour characters. I eventually tracked the bug down to the paint_ready flag never being cleared if the VIC-IV was asked to draw a one-pixel wide 256-colour character.

This came about because the logic previously assumed that there would always be a 2nd-to-last pixel in a character, which is clearly not true when a character is trimmed to be just a single pixel wide.

Now that I have that fixed, I can finally draw variable-width anti-aliased* fonts.  The following two screen shots are of the same Unicode string being rendered using 10 point and 20 point versions of the same font. You can get an idea of the size of characters from the grid of junk at the bottom of each.

Again, the matrix-like tint is due to the colour-cube approximation of the VNC viewer I used to capture the image.  The characters look pure white on the VGA display.

Obviously the 20 point version (top) looks nicer than the 10 point version.  That said, to my wooden eye, they both look pretty decent.  The kerning looks acceptable, and both ascenders and descenders on characters are drawing fine.

I haven't optimised the unicode printing routine code at all yet, so it is still fairly slow to draw. Nonetheless, I can draw those several words in well under a frame, as can be seen in the following screenshot where I set the border to white while the drawing happens:

Earlier I said that I can now draw anti-aliased characters.  This is currently true only if I want to consume the entire 256-colour palette on fake alpha-values.  As discussed in earlier posts, I am part-way through implementing a real alpha-blender that will allow each character to be a different colour, and appear on a different background.

I also need to get around to implementing variable height characters, so that there are no large blank regions between lines when the upper rows of pixels in the top-most row of characters of a font are unused.

The other thing I hope to work on soon is to update the little programme I wrote to display the unicode text so that it can access fonts located outside of the first 64KB of RAM.  This will use the new 32-bit indirect zero-page addressing modes.  Then I will be able to display text from different fonts on the same display.

Sunday 16 November 2014

First steps towards a C65GS laptop

From the outset I had thought about what form-factor the C65GS would have as a computer.

Basically it was a question of whether to make it something that looks and feels like the original C65, or whether to make it into a laptop.

Nostalgia said that something that looked like the original would be nice.  However, the realisation that being able to create 3D-printed or injection moulded replicas of the C65 case would be extremely expensive put a dampner on that idea.

Also, as I got working on the C65GS I found myself increasingly desiring a portable computer - both for taking places to show and so that it could be used for some light general productivity.

So I have been thinking about how to do this, and taking a look at what some other very clever people have done in the past.  The most interesting one is this laptop commodore 64, made using original parts.  This is interesting to me, because it was created using a real C64 keyboard, and he fabricated the case parts himself.  While I don't plan to use a real C64 keyboard for mine, I do intend to use the real C65 keyboard I bought on ebay recently.  Fortunately, the C65 keyboard already has the function keys moved to the top, and so is narrow enough to be used unmodified in a 15" laptop.

More of how Ben fabricates custom cases like this is covered in another post of his.

However, before I get to that point, I need to work out exactly what I intend to cram into the thing.

Obviously the Nexys4 FPGA board needs to go in.

I have already sourced 4 x 30Wh LiFePO4 rechargeable batteries, sufficient to skirt the air-line limits in several countries, but also enough to run a bright screen for a few hours at least.  I'll need to find a charger circuit for them, perhaps something like this that should allow me to put 15.6V in and charge the batteries.  The output from the batteries then can be fed into one of these to provide enough 12V for a 15" LCD panel, and a very similar one of these to provide ample 5V for the FPGA board and related circuitry.

I had originally looked for a 13" 1920x1200 panel, but those are rare as hens teeth, so have settled for a 15.4" one.  After a bit of hunting it looks like this panel and this LCD driver kit should work together, and will accept the 1920x1200 60Hz VGA signal from the FPGA board.

This leaves the joystick/keyboard/IEC serial and other ports, plus speaker driver PCB as the great unknown in terms of internal components. However, even without that PCB, I am starting to have a think about how things would be arranged in the eventual case.

I will probably use a similar approach to Ben's portable C64, and have an external lead that connects the panel in the top half of the case to the lower half.  Because I am using VGA connections, I am contemplating making the cable easily removable, at least from one end, so that it is easy to connect to an external monitor, or indeed use the 15" panel as a VGA display for something else when required.

I might buy a panel and LCD driver kit just to get my head around getting those working together and with the FPGA board, and then incrementally work on everything else as I get the opportunity.

Friday 14 November 2014

Migrating to Vivado (Part 3)

Vivado is still refusing to synthesise the various BRAM blocks correctly.

I have, however, had a response to my post on the Xilinx Community Forums asking for more information about the problem, which I have provided.

Let's hope that they can suggest a solution.

Thursday 13 November 2014

Migrating to Vivado (Part 2)

The title says it all.

I have managed to massage the C65GS source so that it synthesis in Vivado.

This involved some silly things that Vivado should really support, like casting ports in a module. Don't worry about what that means. Just understand that it is something really simple, and Vivado chokes on it.

However, that dwarfs into insignificance with the real problem: Vivado fails to detect when to use BRAM (the little RAM blocks in the FPGA), and instead tries to implement them using logic.  This means that Vivado thinks I need an FPGA 5x bigger than I have, even though ISE can synthesise it to fit in less than 1/2 the current one.

Oddly, it does realise that two small memories should be BRAMs, as the following output from Vivado shows (you might need to make your display really wide to see the gory detail).

Start RAM, DSP and Shift Register Reporting

Block RAM:
|Module Name | RTL Object              | PORT A (depth X width) | W | R | PORT B (depth X width) | W | R | OUT_REG      | RAMB18 | RAMB36 | Hierarchical Name      | 
|framepacker | thumnailbuffer0/ram_reg | 4 K X 8(READ_FIRST)    | W |   | 4 K X 8(WRITE_FIRST)   |   | R | Port A and B | 0      | 1      | framepacker/extram__26 | 
|viciv       | rasterbuffer1/ram_reg   | 2 K X 9(READ_FIRST)    | W |   | 2 K X 9(WRITE_FIRST)   |   | R | Port A and B | 1      | 0      | viciv/extram__42       | 

Note: The table shows RAMs generated at current stage. Some RAM generation could be reversed due to later optimizations. Multiple instantiated RAMs are reported only once. "Hierarch
ical Name" reflects the hierarchical modules names of the RAM and only part of it is displayed.
Distributed RAM: 
|Module Name   | RTL Object               | Inference Criteria | Size (depth X width) | Primitives                     | Hierarchical Name    | 
|gs4510        | shadowram0/ram_reg       | Implied            | 128 K X 8            | RAM256X1S x 4096               | gs4510/ram__25       | 
|iomapper__GC0 | kickstartrom/ram_reg     | Implied            | 16 K X 8             | RAM256X1S x 512                | ram__26              | 
|framepacker   | videobuffer0/ram_reg     | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | framepacker/ram__28  | 
|ethernet      | rxbuffer0/ram_reg        | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__29     | 
|ethernet      | rrnet_rxbuffer/ram_reg   | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__30     | 
|ethernet      | txbuffer0/ram_reg        | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__31     | 
|sdcardio      | sdsectorbuffer0/ram_reg  | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__32     | 
|sdcardio      | sdsectorbuffer1/ram_reg  | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__33     | 
|sdcardio      | f011sectorbuffer/ram_reg | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__34     | 
|uart_monitor  | membuf_reg               | Implied            | 16 X 8               | RAM16X1S x 8                   | uart_monitor/ram__35 | 
|uart_monitor  | historyram0/ram_reg      | Implied            | 1 K X 177            | RAM256X1S x 708                | uart_monitor/ram__37 | 
|viciv         | chipram0/ram_reg         | Implied            | 128 K X 9            | RAM128X1D x 9216               | viciv/ram__38        | 
|viciv         | colourram1/ram_reg       | Implied            | 64 K X 8             | RAM128X1D x 4096               | viciv/ram__39        | 
|viciv         | paletteram0/ram_reg      | Implied            | 1 K X 32             | RAM256X1S x 128                | viciv/ram__40        | 
|viciv         | charrom1/ram_reg         | Implied            | 4 K X 8              | RAM256X1S x 128                | viciv/ram__41        | 
|viciv         | sprite_x_reg             | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__42        | 
|viciv         | sprite_data_offsets_reg  | Implied            | 8 X 6                | RAM32M x 1                     | viciv/ram__43        | 
|viciv         | sprite_colours_reg       | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__44        | 
|viciv         | sprite_y_reg             | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__45        | 
|viciv         | buffer1/ram_reg          | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | viciv/ram__46        | 

All the rest are being implemented as various types of fabric-based RAM.

I have asked for help on the Xilinx Community Forums to see if anyone knows whats going wrong, and how I can work around it.

But for now, it looks like I am using ISE for a bit longer still.

Monday 10 November 2014

Migrating from ISE to Vivado (Part 1)

I am in the process of attempting to migrate C65GS development from Xilinx's old ISE development environment to their new Vivado.  The promise is a much better synthesis tool chain, with shorter synthesis times, better timing, and it should be easier to describe timing constraints which are a black art in ISE.

Vivado is only a couple of years old, which means that it is still quite unstable as these things go.

Just getting it to install can be interesting.

After downloading the 4.8GB (!) tarball, extracting it and running the xsetup program, it got stuck showing just the splash startup window.

Fortunately I was able to find the solution in the following thread:

Basically it boils down to setting the following shell variable, ideally in your .bashrc or similar ( don't forget to log out and in again or source the file for it to take effect after you have changed it).


I am unable to comment on why this works, but simply confirm that it lets me at least see the installer.

Now to see if installation works ...

Even more progress on proportional fonts

Hot on the heals of the earlier fixes of the evening, descenders (the bits of characters that hang below the base-line) and bits poking out the top are now correctly draw as can be seen in the screen-shot below.

Along the way I have fixed bugs in my unicode printing programme, and also a couple of bugs in my true-type font rasteriser (the top pixel of characters could be chopped off, and character tiles were being written to the file in the wrong order).

As with the other screen-shots, the Matrix-like green tint is an artefact of the image capture process I am using, and the bad kerning is pending a patch to the FPGA so that it stops ignoring the least significant bit of the kerning field of each character.

Once the horizontal kerning has been fixed, that just leaves the vertical spacing to fix.  This will probably require both FPGA and more font-file format tweaking.

The above image is captured the same way as the one in the previous post. Then I realised that my VNC viewer was scaling the image, so below is the same image really without any scaling:

Just for further fun and to prove that it works for larger font sizes, I produced a 24 point version of the same font to try out:

This revealed a few new glitches, like why is the "t" in jupiter mostly missing.  There are also some other issues, like the width of the right vertical of "H", Alef (the right-most Hebrew character) that I will have to look into.

The disappearing "t" turned out to be a quick fix, so the final result for the night is as follows:

 Note also that with a big font the suppression of the spare blank raster lines becomes less important.  I still intend to fix it, however.

Sunday 9 November 2014

More work on proportional fonts

I have done a bit more work on the unicode text display programme.

The following screen-shot is completely generated by the programme, in contrast to the screen-shot in the previous post where I manually populated the screen memory with the tiles and kerning adjustments.

The Unicode string used to feed the program is:

       .word 'H,'e,'l,'l,'o,$20,'W,'o,'r,'l,'d
       .word $000d
       .word $05d4,$05d3,$05d2,$05d1,$05d0
       .word $000d
       .word 'g,'a,'r,'y,$20,'j,'u,'p,'i,'t,'e,'r
       .word $000d       
       .word 0

Character $20 is of course space, and the $05xx characters are some Hebrew glyphs I threw in just to show that we are not limited to latin characters.

So, how did it turn out?

The screen shot is actual size so that there is hopefully no strange things going on in terms of smoothing of the zoomed image.

The slight green sheen is an artefact of the C65GS VNC server which sends pixels using a 3x3x2 colour cube, and so suffers some colour distortion.  This doesn't need fixing since it is purely a VNC artefact.

We still have the kerning glitch, because the VIC-IV isn't honouring the least significant bit of the kerning field, so H and W are followed by one more blank pixel than they should.  This is on my list to fix in the FPGA programme.  I'll likely rearrange the bits so that the kerning bits are all adjacent, instead of spread between two different bytes (one of which is clearly not being read properly).

The programme has, however, faithfully rendered the first line of Latin and Hebrew Unicode characters, including H and W that are two tiles wide.

It also handles the carriage-returns, however, it doesn't correctly calculate how many rows of characters are needed, nor does it allow for reducing the height of character rows to provide correct line spacing.  This will require implementing the remaining hardware support for this in the VIC-IV.

Also, in the character rows that don't contain active non-blank glyphs, it isn't putting a blank tile there, nor is it adjusting the width of each to kern them to the correct width, so there is rubbish which is wider than the actual text on the 2nd and 3rd rows of each line of output.  This won't be too hard to fix, as it is just a software issue.

Finally, the third row is really messed up, because the line of characters that are drawing the under-hanging pixels are being written over the main row, instead of being written into the next row down.  This shouldn't be too hard to fix either, as it is just a software issue.

Also, it would be nice if the routine cleared the remainder of the screen, but that's really just icing on the cake.

So for now we have a nice bit of progress, and I might take a look at fixing the kerning and rows of junk bugs next.

Saturday 8 November 2014

Easily Accessing All of Memory

As I have begun work on writing a general-purpose Unicode string renderer for the C65GS, it has got me right at the coal face of developing software to run on this machine, as compared to just making the machine. Creator and user have rather different needs and experiences.
Until now, I had simply assumed that I would use DMA and memory banking to provide access to all of memory.  Technically it provided everything that one could need.  Once I started to write software for the machine, it became immediately apparent that these methods were fine for accessing slabs of memory, but would be rather inconvenient for the normal use case of reading or writing some random piece of memory somewhere.
What I found was that if I had a pointer to memory and wanted to PEEK or POKE through that pointer, it was going to be a herculean task, and one that would waste many bytes of code and cycles of CPU to accomplish -- not good for a task that is the mainstay of software.
I was also reminded that the ($nn),Y operations of the 6502 are essentially pointer-dereference operations.  So I thought, why don't I just allow the pointers to grow from 16-bits to 32-bits.  Then one could just use ($nn),Y or ($nn),Z operations to act directly on distant pieces of memory.
Slight problem with this is that the 4502 has all 256 opcodes occupied, so I couldn't just assign a new one.  I would need some sort of flag to indicate what size pointers should be.  This had to be done in a way that would not break existing 6502 or 4502 code.
The experience of the 65816 led me to think that a global flag was not a good idea, because it makes it really hard to work out what is going on just by looking at a piece of code, especially where instruction lengths change.
So I decided to go for a bit of an ugly hack: If an instruction that uses the ($nn),Z addressing mode immediately follows and EOM instruction (which is what NOP is called on the 4502), then the pointer would be 32-bits instead of 16-bits.
While ugly, it seems to me that it should be safe, because no 6502 code uses ($nn),Z, because it doesn't exist. Similarly, there is so little C65 software that it is unlikely that any even uses ($nn),Z, and even less of it should have an EOM just before such an instruction.  
In fact, in the process of implementing 32-bit pointers, I discovered that ($nn),Z on the 4510 was actually doing ($nn),Y, among other bugs.  So clearly the C65 ROM mustn't have even been using the addressing mode at all!
Here is the summary of how this new addressing mode works in practise.  The text below is as it appears in the C65GS System Notes which is being developed with the help of the community.

32-bit Memory Addresses using 32-bit indirect zero-page indexed addressing

The ($nn),Z addressing mode is normally identical in behaviour to ($nn),Y other than that the indexing is by the Z register instead of the Y register.  That is, two bytes of zero-page memory are used to form a 16-bit pointer to the address to be accessed. However, if an instruction using the ($nn),Z addressing mode is immediately preceded by an EOM instruction, then it uses four bytes of zero-page address to form a 32-bit address.  So for example:

zppointer: .byte $11,$22,$33,$04

ldz #$05
lda ($nn),Z

Would load the contents of memory location $4332216 into the accumulator.

LDA, STA, EOR, AND, ORA, ADC and SBC are all available with this addressing mode.

Memory accesses made using 32-bit indirect zero-page indexed addressing require three extra cycles compared to 16-bit indirect zero-page indexed addressing: one for the EOM, and two for the extra pointer value fetches.

This makes it fairly easy to access any byte of memory in the full 28-bit address space.  The upper four bits should be zeroes for now, so that in future we can expand the C65GS to 4GB address space.

Wednesday 5 November 2014

Starting to write tests for 16-bit text mode, including proportional fonts

I have been bashing away at a little test program that demonstrates printing unicode strings using a proportional font.

This involves looking up the glyph in the font, then getting its tile map, and then building the screen lines to draw and several other steps.  While none of these steps are too complicated, it does make for about 1KB of code, including switching screen modes and setting the palette to a grey gradient since I don't have the alpha-blender working yet.

I could have settled for just a simple hard-coded test, but I think it is worth exercising the whole idea of how I intend to draw proportional fonts to make sure that it is feasible.

Anyway, it's got late, and I haven't got the code working yet, but I did spend a few minutes hand picking the necessary tiles and setting the kerning values to narrow the characters down so that it looks half-decent.

The main visual glitch is that single pixel resolution of the kerning is not working for some reason, so the gap between "H" and "e" is one pixel too wide.

Also, the colour cube that the VNC viewer uses butchers the gradient, and so some of the shading looks weird.  On the real screen this looks quite a bit better. Nonetheless, the result is pretty reasonable for a first go at it, and will certainly look nicer once I figure out why it is ignoring the least significant kerning bit so that there is no big gap between "H" and "e".

Tuesday 4 November 2014

More work on proportional fonts

I am edging my way towards getting hardware proportional fonts working.

To recap progress to date, text mode can be switched to 16-bit mode, where two screen RAM bytes and two colour RAM bytes describe each character instead of one of each.  Some of those extra bits can be used to specify the width of a character, between 1 and 8 pixels wide.  Thus proportional fonts can be constructed using some full-width characters and some narrowed characters.

I have now written a little programme that takes a true-type font, and produces a font file composed of 8x8 character blocks.  It isn't perfect, but it does create a simple file with a list of unicode points, tile maps that say which tiles go to make each glyph, and then the array of 8x8 tiles, 64 bytes each.  You can see the source at:

It doesn't do any compression of the 64-byte blocks, so the fonts are quite a bit bigger than they need to be.  However, it should be very easy to write some 6502 assembler that given a 16-bit unicode string, can setup a 16-bit text mode screen to display the text.

Simultaneously, I have been working on the anti-alias renderer for 8x8 tiles. It isn't done yet, but the alpha blender is in the design now.  Assuming that it works, it shouldn't be too hard to plumb it in, and start displaying alpha-blended 8x8 character tiles.

Sunday 2 November 2014

Loading data via ethernet

As I have been working on the proportional font support for the VIC-IV, I have come to the realisation that I need some fairly comprehensive and reproducible test programmes due to the complexity of the VIC-IV text mode.

I can load programmes onto D81 images and put those on the SD card, and mount them from the simple menu I made. However, it requires physical plugging and unplugging, and is a general nuisance.

The other approach that I have at the moment is that I can use the serial monitor to load data, but VERY slowly and unreliably.  This is basically because the serial monitor doesn't have a FIFO on the serial input, and so it looses characters quite easily.

I do have a working 100mbit ethernet adapter, however.   Ideally, I would just use the RR-NET emulation to enable me to run the RR-NET version of 64NET.  However, there are still enough bugs in the RR-NET emulation to stop that from working.  This also could be more easily debugged if I could quickly and easily load test programmes onto the C65GS.

So I have finally gotten around to writing a little program that looks for UDP packets coming in via ethernet, and then providing a way to load them into memory somewhere.

Because the C65GS ethernet buffer is direct memory mapped, I can use a nice trick, of having the main loading routine actually inside the packets.  This means that the ethernet load programme on the C65GS can be <128 bytes, and yet support very flexible features, since the sending side can send whatever code it likes.  It is only about 100 lines of 4502 assembler, so I'll just include the whole thing here.

.org $CF80
First, we need to turn on C65GS enhanced IO so that we can access the ethernet controller:
lda #$47
sta $d02f
lda #$53
sta $D02f
Then we need to map the ethernet controller's read buffer.  This lives at $FFDE800-$FFDEFFF.  We will map it at $6800-$6FFF.  The 4502 MAP instruction works on 8KB pieces, so we will actually map $6000-$7FFF to $FFDE000-$FFDFFFF.  Since this is above $00FFFFF, we need to set the C65GS mega-byte number to $FF for the memory mapper before mapping the memory.  We only need to do this for the bottom-half of memory, so we will leave Y and Z zeroed out so that we don't change that one.
lda #$ff
ldx #$0f
ldy #$00
ldz #$00
Now looking at the $DE800 address within the mega-byte, we use the normal 4502/C65 MAP instruction semantic.  A contains four bits to select whether mapping happens at $0000, $2000, $4000 and/or $6000. We want to map only at $6000, so we only set bit 7.  The bottom four bits of A are bits 8 to 11 of the mapping offset, which in this case is zero.  X has bits 12 to 19, which needs to be $D8, so that the offset all together is $D8000.  We use this value, and not $DE000, because it is an offset, and $D8000 + $6000 = $DE000, our target address.  It's all a bit weird until you get used to it.
lda #$80
ldx #$8d
ldy #$00
ldz #$00
Now we are ready to make sure that the ethernet controller is running:
lda #$01
sta $d6e1
Finally we get to the interesting part, where we loop waiting for packets.  Basically we wait until the packet RX flag is set

lda $d6e1
and #$20
beq waitingforpacket
So a packet has arrived.  Bit 2 has the buffer number that the packet was read into (0 or 1), and so we shift that down to bit 1, which selects which buffer is currently visible.  Then we write this to $D6E1, which also has the effect of clearing the ethernet IRQ if it is pending.
lda $d6e1
and #$04
ora #$01
sta $d6e1
Now we check that it is an IPv4 UDP packet addressed to port 4510
; is it IPv4?
lda $6810
cmp #$45
bne waitingforpacket
; is it UDP?
lda $6819
cmp #$11
bne waitingforpacket
; UDP port #4510
lda $6826
cmp #>4510
bne waitingforpacket
lda $6827
cmp #<4510
bne waitingforpacket
If it is, we give some visual indication that stuff is happening. I'll take this out once I have the whole thing debugged, because it wastes a lot of time to copy 512 bytes this way, since I am not even using the DMAgic to do it efficiently.  In fact, this takes more time than actually loading a 1KB packet of data.
; write ethernet status to $0427
lda $d6e1
sta $0427

; Let's copy 512 bytes of packet to the screen repeatedly
ldx #$00
loop1: lda $6800,x
sta $0428,x
lda $6900,x
sta $0528,x
bne loop1
The final check we do on the packet is to see that the first data byte is $A9 for LDA immediate mode.  If so, we assume it is a packet that contains code we can run, and we then JSR to it:
lda $682c
cmp #$a9
bne loop
jsr $682C
Then we just go looking for the next packet:
jmp loop

As you can see, the whole program is really simple, especially once it hits the loop.  This is really due to the hardware design, which with the combination of DMA and memory mapping avoids insane fiddling to move data around, particularly the ability to execute an ethernet frame as code while it sits in the buffer.

The code in the ethernet frame just executes a DMAgic job to copy the payload into the correct memory location.  Thus the complete processing of a 1024 byte ethernet frame takes somewhere between 2,048 and 4,096 clock cycles -- fast enough that the routine can load at least 12MB/sec, i.e., match the wire speed of 100mbit ethernet.

On the server side, I wrote a little server program that sends out the UDP packets as it reads through a .PRG file.  Due to a bug in the ethernet controller buffer selection on the C65GS it currently has to send every packet twice, effectively halving the maximum speed to a little under 6MB/sec.  That bug should be easy to fix, allowing the load speed to be restored to ~10MB/sec.  (Note that at the moment the protocol is completely unidirectional, but that this could be changed by sending packets that download code that is able to send packets.)

When the server reaches the end of the file, the server sends a packet with a little routine that pops the return address from the JSR to the packet from the stack, and then returns, thus effectively returning to BASIC -- although it does seem to mess up sometimes, which I need to look into.

It would be nice to test the actual speed of the resulting system, but I haven't really got a setup for this yet in a robust way.

However, one can get some sort of idea of the speed by timing the server program.

When loading SynthMark64, which is about 8KB, it takes between 0.01 and 0.02 seconds, which equates to 400kb/sec - 800kb/sec.  Loading a larger programme of about 43KB takes 0.03 seconds, giving a speed of about 1.5MB/sec, which is a bit more respectable. [EDIT: I have now fixed the ethernet buffer bug, and so loading speed is easily exceeding 2MB/sec, and will likely go higher with some tuning.]

All that remains to make this useful is to fix the bugs that stop it from returning to BASIC properly, and to add the little 128 byte programme into the kickstart ROM so that it is available on boot up, just like the disk image mounting menu.

Friday 31 October 2014

More work towards hardware proportional fonts

I have spent a bit more time tonight working on hardware support for proportional fonts.

For those coming in late, the VIC-IV already has the ability to draw skinny characters 2, 4 or 6 pixels wide as well as the usual 8.  This can be used to construct large characters from one or more 8x8 character blocks to make any even number of pixels in width on screen.  For a large type face, each character may be several 8x8 character blocks wide.

This means that a row of proportional text may have a variable number of characters, because if there are skinny characters, then more will fit on a line.  Conversely, if there is no text on the right of the display, then it doesn't make sense to waste RAM describing empty characters.  Thus I have followed Jeremy's idea of implementing a special end of line marker, so that each row can differ in length, and we can hopefully use RAM much more efficiently when faced with large high-resolution text displays.

In the previous post I describe the work on skinny characters.

Now I have just about finished implementing the end of line markers, although as I write there is one remaining bug which is quite obvious in the screenshot below in the form of the vertical bars that shouldn't be there:

At first glance, this looks mostly like a normal C64 text mode display.  However the entire screen is described using only about 80 bytes each of screen and colour RAM:

The screen RAM:

:0400 01 C0 02 00 03 00 04 00 05 00 FF FF 06 00 07 00
:0410 08 00 FF FF FF FF FF FF FF FF FF FF 09 00 0A 00
:0440 0B 00 FF FF 0C 00 FF FF 0D 00 FF FF

The colour RAM:

:D800 00 01 02 03 04 05 07 01 02 02 02 02 02 0E 0F 0E
:D810 0E 0E 0E 0E 00 00 00 00 02 03 04 05 02 08 01 02
:D820 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D830 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D840 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D850 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E

To follow what is going on, remember that in this mode ($D054 = $01), two bytes are used to describe each character.  The first byte is the low 8 bits of the character number, and the low nybl of the second byte are extra character number bits.  The top two bits of the second byte set the width of the character as it is displayed on screen.

Thus $01 $C0 at $0400 draws only the left two pixels of the letter A, which shows up as the stumpy black line in the screen shot.  The rest of the row of text is now offset by 6 pixels compared to normal.

Normal characters are encoded from $0402 - $0409.  This is followed by $FF $FF which tells the VIC-IV that there is an early end of line.  Thus the letter F described by $06 $00 at $040C-$040D appears at the beginning of the next line, and no other characters appear to the right of the letter E.

Colour RAM is drawn in a somewhat strange way, that I will probably fix.  Within a row, the colour RAM bytes are read one per character, and so $D801 has $01 (white), and this is applied to the letter B encoded in $0402-$0403.  That seems quite reasonable.  But following an end of line, the colour RAM address catches up with the screen RAM.  What I intend to do is make the colour RAM address advance two bytes for each character, so that the extra byte of information can be used.  I might use this to allow skinny characters to be odd widths, and also to have a kind of super-extended-background-mode, where the other bits select the background colour.

However, before I do any of that, I need to fix the bug that happens when a character row consists only of a $FF $FF end of row marker. In that case the length of the character data is incorrectly set to the maximum value, instead of zero, and so whatever rubbish was hanging around in the character raster buffer gets redrawn.

Although, as I write that, I am not entirely convinced that this is the whole story.  Indeed, it seems that the three bars are the contents of $3894, $300C, and $80CB. Very strange.

Hopefully my little bug-fix will work, otherwise I will just specify that each row must have at least one character, so you would use something like $20 $00 $FF $FF to make a row empty with one space at the beginning.

Thursday 30 October 2014

Hardware support for proportional fonts and other text mode effects

I was talking with Jeremy in the lab today, and we got talking about hardware support for proportional fonts on the C65GS. I had thought about doing this before, but had put it on the back-burner for a while, because I hadn't really come up with an elegant solution.

While we were talking today, however, we got talking about how I had 2 spare bits in the character number in 16-bit text mode, where two screen RAM bytes are used for each character.  I don't remember exactly who came up with which part of it, but by the end of it, we had come up with a workable solution not only for hardware support for proportional fonts, but anti-aliased fonts, too (more on that in a future post).

Proportional fonts really just requires specifying the width of each character, so that some can be narrower than others.  With two spare bits, it was easy enough to allow characters to be 8, 6, 4 or 2 pixels wide. Characters more than 8 pixels wide can then be constructed with any number of 8 pixel wide characters, and one narrow character to make it easy to obtain any even number of pixels in width.  It would be nice to allow odd widths, but it feels like it is a reasonable trade-off to have to round to the nearest 2 pixels.

Similarly, tall fonts can be constructed with multiple rows of text.  For fonts not a multiple of 8 pixels high, a raster split could be used to skip one or more rows of pixels.

The result is conceptually very simple.  The main trade-off is that skinny characters still use the same amount of RAM as full-width ones, but that seems a reasonable trade-off.  In fact, it was so simple that I was able to implement it in about an hour, and the main functionality worked first time, as can be seen below:

Each row has a different width specified for the "A" at the beginning of the line.

To use this feature, you first have to enable 16-bit text mode, where two bytes of screen memory describe each character on screen.  This is done by setting bit 0 in $D054.

In terms of screen RAM, the memory for the rows here looks something like:

0400 01 00 02 00 03 00 04 00 05 00
0428 01 40 02 00 03 00 04 00 05 00
0450 01 80 02 00 03 00 04 00 05 00
0447 01 c0 02 00 03 00 04 00 05 00

The 01, 02 ... 05 is the character numbers for A through E, and these stay the same.  For all but the A's, the 2nd byte is 00 to indicate that no special attributes are set for the characters B through E.  However, the high-byte for the A characters is modified in each row to have all possible combinations in bits 6 and 7: the higher the value, the narrowerer the character.

While I am fiddling with text attributes, I should explain what the other bits in the high byte mean:

bits 0 - 3 = bits 11 - 8 of the character number.  i.e., there can now be 4,096 characters in a character set.
bit 4 = flip character horizontally
bit 5 = flip character vertically

The following screen shot shows the flip bits in action:

The contents of screen RAM is:

0400 01 00 02 10 03 20 04 30 05 00

The ability to flip characters is designed to be used with full-colour text mode, where (some or all) characters on the screen consist of 64 8-bit pixels, providing a graphics mode that can be quickly scrolled.  

Flipping characters in such a mode allows 64-byte characters to be reused in a graphical display without too much obvious repetition, e.g., for in textures in games.  

Combining this with variable width characters introduces even more opportunity to reuse characters, and thus allow more interesting and complex high-resolution graphics within the limits of the 128KB of chipram.

Wednesday 29 October 2014

My real C65 keyboard has arrived, very well wrapped

A while back I won an e-bay auction for a genuine C65 keyboard (without printing on the top of the keys).  Today it arrived, as you can see below.  It was nice to again touch a C65 keyboard after about four years of not having one.  The lack of printing on the top of the keys is not a big problem, as the key positions are fairly easy to figure out, and the printing on the front provides strong clues as well.

I also have to commend the seller for his meticulous packing.  Below you can see lots of soft packing material, which I could only get to once I had cut through quite a lot of tape that held the box very well closed.

Inside that was the following inner part, also completely mummified in tape.  This keyboard had no hope of escaping in transit.

Then once I de-mummified that, it was itself a further very large bag, sealed at the mouth with further tape:

Removing that tape, the quite large bag looked to also be heat-sealed.

Finally inside that the keyboard was sitting in its original light-blue Mitsumi packing bag, and in very good condition -- better than the original one I had, on which I had to repair a track on the cable when it first arrived.

So now I have a real C65 keyboard, a recreation C65 motherboard, lacking only the custom chips, and my FPGA design.  I am getting closer to being able to make a complete working unit with real hardware.  I am still leaning towards making a laptop form-factor C65GS, if only I can find a nice LCD panel that can do 1920x1200 or 1920x1080 and isn't too big.  About 13" would be ideal, I think.

Hardware thumbnail generation is now usable

Today I had a chance to fix a few bugs with the hardware thumbnail generation, and actually test it out by writing a small program that does a raster split, with the thumbnail being draw in 256-colour mode in the bottom corner of the screen every frame, as you can see below:

There are still a few glitches, but is is working pretty nicely.  In particular, you can see that it is being drawn in real-time, because the thumbnail contains an image of the thumbnail that contains an image of the thumbnail :)

Things you can't see, is that the thumbnail is a bit different every frame, I think because the counter that decides which raster to look at isn't reset at the start of each frame.  This causes some weird things to happen.  Also, the thumbnail just contains the value of chosen pixels.  I am considering changing this so that it shows the average of the pixels in the sample area of the raster line, but at the same time, the current scheme works fairly well.

This is a nice milestone for a few reasons.

First, the hardware thumbnail generator is clearly working to some reasonable degree.

Second, full-colour text mode is working fairly well as well, such that I could write this program.

And finally, I have actually written a programme that does something (slightly) useful, using C65GS special features, and it works :)

Next stop is to fix the counter problem, and see if I am then happy enough with it, and if so, to move on to some of the other interesting things in my queue, like enhanced sprites.

Thursday 23 October 2014

Hardware thumbnail generator for task-switcher

One of the main reasons for implementing the hypervisor is so that it will be possible to switch between different tasks running on the machine.  The tasks won't be running at the same time, but rather they will be suspended while another task is running.

For a task-switcher to be nice, it would be really handy to be able to show a low-res screen-shot of the last state of each task so that the user can visually select which one they want.  In other words, to have something that is not too unlike the Windows and OSX window/task switcher interfaces.

However, this is tricky on an 8-bit computer that has no frame buffer, and may be using all sorts of crazy raster effects.

Thus I need some way to have the VIC-IV update a little low-res screen shot, i.e., a thumbnail image, that the hypervisor can read out, and retain for later task-switching calls to show the user what was running in each task before they were suspended.

So I set about implementing a little 4KB thumbnail buffer which is automatically written to by the VIC-IV, and which can be read from the hypervisor.  This resolution allows for 80x50, which should be sufficient to get the idea of what is on a display.  Each pixel is an 8-bit RRRGGGBB colour byte.

Because the VIC-IV writes the thumbnail data directly from the pixel stream, it occurs after palette selection, sprites and all raster effects.  That is, the thumbnails it generates should be "true".

After a bit of fiddling around, it is mostly working.

To test it, I wrote a little BASIC programme that reads from the one-byte access to the 4KB buffer, copying it to $4000-$4FFF.  Then I used the serial monitor to grab that copy of the data, and wrote some UNIX shell scripts and a little C programme to munge it into an 80x50 Windows BMP file.

Here is how it looks, with the image rather enlarged to make it easier to see:

While not perfect, it is an improvement on the first capture, where I forgot to read from the start of the thumbnail buffer, so it was all out of whack:

In need to find out what is causing the "clouds", and also why it is writing only 77 pixels per line instead of 80 pixels per line.

But other than these problems, I am well on the way to being able to present a nice graphical display to allow for switching between tasks from the hypervisor.

A 3rd party operating system for the C65GS?

I was quite surprised today in a happy way to see that someone is considering making an operating system for the C65GS:

This would be a port of the yet-to-be-complete ACE128 operating system.

I understand that Miro could do with some contributors to help, whether for the 65GS or 64/128 version.

Sunday 19 October 2014

Virtualisation and Task Switching

One of the features I have wanted to include in the C65GS from early on is some sort of task switching and rudimentary multi-tasking.

Given the memory and processor constraints, I don't see the C65GS as running lots of independent processes at the same time.  Rather, I want it to be possible to easily switch between different tasks you have running.

For example, you might be using Turbo Assembler to write some code, and decide to take a break playing a game for a few minutes, but don't want to have to reload Turbo Assembler and your source code again.

Or better, with a patched version of Turbo Assembler you might want to edit in one task and have it assemble into a separate task, and switch back and forth between them as you see fit.

It would also be nice to be able to have certain types of background processing supported.  For example, being able to leave IRC or a download running in the background, with it waking up whenever a packet arrives or a timeout occurs.

For all these scenarios, it also makes sense to be able to quarantine one task from another, so that they cannot write to one another' memory or IO without permission.  This implies the need for some sort of memory protection, and supervisor mode that can run a small operating system to control the tasks (and their own operating systems) running under it.

Thus, what we really want is something like VirtualBox that can run a hypervisor to virtualise the C65GS, so that it can have C64 or C65 "guest operating systems" beneath, and keep them all separate from each other.

This doesn't actually need much extra hardware to do in a simplistic manner.

First, we need the supervisor/hypervisor CPU mode that maps some extra registers.  I have already implemented this with registers at $D640-$D67F.

Second, to make hypervisor calls fast, the CPU should save all CPU registers and automatically switch the memory map when entering and leaving the hypervisor.  I have already implemented this, so that a call-up into the hypervisor takes just one cycle, as does returning from the hypervisor.

Third, we need to make the hypervisor programme memory only visible from in hypervisor mode.  I have already implemented this.  The hypervisor program is mapped at $8000-$BFFF, with the last 1KB reserved as scratch space, relocated zero-page (using the 4510's B register), and relocated stack (again, using the 4510's SPH register).  I am in the process of modifying kickstart so that it works as a simple hypervisor.

Fourth, we need some registers that allow us to control which address lines on the 16MB RAM are available to a given task, and what the value of the other address lines should be.  This would allow us to allocate any power-of-two number of 64KB memory blocks to a task.  When a task is suspended, it's 128KB chipram and 64KB colour RAM and IO status can be saved into other 64KB memory blocks that are not addressable by the task when it is running.  This I have yet to do.

Fifth, we need to be able to control what events result in a hypervisor trap, so that background processes can run, and also so that the hypervisor can switch tasks.  The NMI line is one signal I definitely want to trap, so that pressing RESTORE can activate the hypervisor.

By finishing these things, and then writing the appropriate software for the hypervisor, it shouldn't be too hard to get task switching running on the C65GS.