In a recent post, I worked out how I am going to store SMS conversations, contacts, phone state and a pile of other stuff in D81 disk images, for convenient manipulation.
So now I need some tools that let me create those disk images and prepare their contents. Now, some of these functions will be needed on the MEGAphone, and others on the Linux build system. So I should try to make as many of them as common as possible to save re-work. This will also allow me to make unit tests for stuff that needs to run on the MEGA65.
So let's start with code that can take a 800KB blank disk image (or literally empty 800KB file on Linux) and populate it with a valid 1581 disk label and BAM sectors.
This is fairly straight-forward:
Disk header + first BAM sector on track 40 sector 0 (note I'm talking about physical 512 byte sectors here, because that's what the MEGA65 FDC sees).
Second BAM sector in first half of track 40 sector 1 + first 8 directory entries.
The wikipedia page for the 1581 has a reasonable run-down on what goes in the header, as well as the BAM sectors. There is an even better description here.
Now, for almost all these weird disk images for the telephony software, we also want to allocate all data blocks in a single sequential file, as far as the 1581 file system is concerned, so that if someone VALIDATEs such a disk image Nothing Bad Will Happen. Unlike, say, GEOS formatted disks.
To make the code runnable on Linux and on the MEGA65, I'll need to make a Hardware Abstraction Layer (HAL), just as we did with the MEGA65 FDISK program -- so this isn't the first time we've done this kind of thing. The correct HAL include will be selected at compile time, and the necessary files linked in:
#include <stdio.h>
extern unsigned char sector_buffer[512];
#define SECTOR_BUFFER_ADDRESS ((unsigned long long) §or_buffer[0])
void hal_init(void);
void lfill(unsigned long long addr, unsigned char val, unsigned int len);
void lcopy(unsigned long long src, unsigned long long dest, unsigned int len);
void write_sector(unsigned char drive_id, unsigned char track, unsigned char sector);
char mount_d81(char *filename, unsigned char drive_id);
char create_d81(char *filename);
One of the challenges for the actual code to handle the disk images etc, is that it needs to be as compact as possible for compilation with CC65. Also, we have to remember that an int on CC65 is 32 bits, not 16 bits. Also, function calls are not particularly cheap.
Using our HAL, we can produce fairly compact code to assemble and write the sectors:
void format_image_fully_allocated(char drive_id,char *header)
{
char i,j;
/* Header + BAM Sector 1
---------------------------------------- */
// Erase sector buffer
lfill(SECTOR_BUFFER_ADDRESS,0x00,512);
// Clear header name
lfill(&header_template[4],0xa0,16);
// Set header name
for(i=0;(i<16) && (header[i]);i++) {
header_template[4+i] = ascii_to_petscii(header[i]);
}
lcopy(header_template,SECTOR_BUFFER_ADDRESS, sizeof(header_template));
// BAM is easy: Just a valid header, and leave all bitmap bytes $00, to
// indicate that there is no free space on the disk.
lcopy(bam_template,SECTOR_BUFFER_ADDRESS + 0x100,sizeof(bam_template));
write_sector(drive_id,40,0);
Note that for CC65 compatibility, we also have to declare variables at the start of blocks, because it generally uses an older C dialect.
So now I should make a tool that tries to create the set of D81s and test it under Linux, and then on the MEGA65.
This will start with a CONTACT0.D81 and STATE.D81 files. On Linux we can also create the directory structure. But I don't think we have the MKDIR Hypervisor call on the MEGA65 implemented yet. The MKFILE call does seem to be implemented, though -- but only supports creating normal files with sizes that are multiples of 512KB (well, allocation happens in multiples of 512KB -- it will write whatever specific file size is requested into the length field). So in theory we could use that as the basis for MKDIR, just changing the attribute byte and adding the . and .. special directory entries.
But let's instead just pike out, and require that the directory structure be created elsewhere for now, and only create it if we are on Linux. We can insert the SD card into a reader on the Linux machine as an easy way to bootstrap things. Otherwise I'll be lost for half a week or more just getting MKDIR to work, especially since I don't know if MKFILE is working reliably.
In fact, there is probably an argument for fully populating the whole structure with empty D81s. 1,580 contacts, each with an 800KB D81 for contacts and some supporting directory structure is still only a little over 1GB. And having everything pre-allocated will only speed things up. The main issue with doing that is when we sort the contacts, the mapping to the contact directories would change.
We can deal with that by pre-populating a contact UID in each otherwise empty contact, so that when the contact lists get sorted the underlying physical contact directory doesn't change. We then just have to make sure we erase the message log for a contact, and any other stuff we have in the contact directory, whenever we delete a contact, so that the message thread doesn't get accidentally associated with some new contact later on.
But first, let's check if I'm generating valid D81 files.
I found a couple of entertaining bugs: My check for not writing over the directory track when creating the sector chain was checking for track 0x40, not track 40 (decimal). Whoops!
I also realised I hadn't set the sector count for the pseudo-file I create that links to the chained sectors on the whole disk, so it was showing as 0 blocks. It's only cosmetic, but still worth fixing.
The next thing is that I know I've screwed up the sector linking somehow.
It should show monotonically increasing sector numbers in the links every 256 (0x100) bytes). But it looks like this (in correct links in bold).
00000000 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000100 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000200 01 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000300 01 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000400 01 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000500 01 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000600 01 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000700 01 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000800 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000900 01 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000a00 01 0b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000b00 01 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
So the links in the start of each 512 byte physical sector look fine, but those of the 2nd logical sector in each physical sector are wrong: I had forgotten to put the x2 in that I had for the first.
That got it better, but the final sector of the disk pointed to track 81, instead of having an end-of-file marker. Fixed that, and now c1541 can extract the file without error
c1541 -attach ../PHONE/STATE.D81 -extract
$ ls -latr tmp/all\ sectors
-rw-rw-r-- 1 paul paul 802639 Jul 28 10:14 'tmp/all sectors'
This looks encouraging. There should be 79 tracks (80 minus 1 for the directory track) x 20 physical sectors x 2 logical sectors per physical sector x 254 bytes per sector = 802640 bytes. The file is only 1 byte short. So maybe the end of file marker should be 0x00 not 0xFE. Yup --- that got the extra byte.
So the question is whether the file contents correctly represents the contents of the disk, and the sectors are in order. To make this easy to check, I added some code that writes the track and sector into the first data bytes of each sector, so that when I extract the "file", I can check the content of that file for the monotonically increasing track and sector numbers.
// First half always links to 2nd half of physical sector
lpoke(SECTOR_BUFFER_ADDRESS+0x000,i);
lpoke(SECTOR_BUFFER_ADDRESS+0x001,j*2+1);
#define CONTENT_TEST
#ifdef CONTENT_TEST
lpoke(SECTOR_BUFFER_ADDRESS+0x000+2,i);
lpoke(SECTOR_BUFFER_ADDRESS+0x001+2,j*2+1);
#endif
// Second half points to next physical sector, or to first
// sector of next track (or track 41 if we are on track 39,
// so that we skip the directory).
if (j<(20-1)) {
lpoke(SECTOR_BUFFER_ADDRESS+0x100,i);
lpoke(SECTOR_BUFFER_ADDRESS+0x101,j*2+1+1);
#ifdef CONTENT_TEST
lpoke(SECTOR_BUFFER_ADDRESS+0x100+2,i);
lpoke(SECTOR_BUFFER_ADDRESS+0x101+2,j*2+1+1);
#endif
} else {
lpoke(SECTOR_BUFFER_ADDRESS+0x100, (i==39)?41:(i+1));
lpoke(SECTOR_BUFFER_ADDRESS+0x101,0);
#ifdef CONTENT_TEST
lpoke(SECTOR_BUFFER_ADDRESS+0x100+2, (i==39)?41:(i+1));
lpoke(SECTOR_BUFFER_ADDRESS+0x101+2,0);
#endif
}
if (i==80&&j==(20-1)) {
// Terminate chain at end of disk
lpoke(SECTOR_BUFFER_ADDRESS+0x100, 0x00);
lpoke(SECTOR_BUFFER_ADDRESS+0x101, 0xff);
#ifdef CONTENT_TEST
lpoke(SECTOR_BUFFER_ADDRESS+0x100+2, 0x00);
lpoke(SECTOR_BUFFER_ADDRESS+0x101+2, 0xff);
#endif
}
This is all great, and I can see the data markers in the file:
$ hexdump tmp/all\ sectors
0000000 0101 0000 0000 0000 0000 0000 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
*
00000f0 0000 0000 0000 0000 0000 0000 0000 0201
0000100 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0301 0000
0000200 0000 0000 0000 0000 0000 0000 0000 0000
*
00002f0 0000 0000 0000 0000 0000 0401 0000 0000
0000300 0000 0000 0000 0000 0000 0000 0000 0000
*
So that's all good. The markers are every 254 bytes, because the first two bytes of each sector are used for the link to the next track and sector. This is why the position of the markers looks like it moves backwards.
That's all great.
What has me confused is that while the markers are sector then track. I was getting quite worried, until I remembered that hexdump by default shows 16 bit values, i.e, with byte order swapped. Why? No idea. But the point is that it works: I'm creating D81 files that have all sectors allocated into one big chain, and it's all working.
So now I can work on some tools that take a fictional contact list and message stream of messages between contacts, and smoosh them into the correct D81 files --- and then make a tool that goes the other way, i.e., extracts a similar canonical list of contacts and messages from the disk images. With those tools, I'll then be in a position to be able to easily test the actual routines for receiving and logging messages etc.
So let's start by coming up with a simple canonical format for contacts with first and last name, fictional phone numbers, and then generating message traffic. This sounds like a good job for some vibe coding with an LLM. And indeed it's done a pretty fine job quickly for me, letting me generate files like this:
CONTACT:+9912208959:Donald:Hall:
CONTACT:+9992142067:Jeffrey:Jones:
CONTACT:+9991073060:Nicole:Mccarty:
CONTACT:+9911406450:Alyssa:Esparza:
CONTACT:+9961792880:Sophia:Sanchez:
CONTACT:+9971265517:Derek:Perry:
CONTACT:+9995659445:Jason:Williams:
CONTACT:+9912477184:Erica:Mack:
CONTACT:+9997053229:Rita:Lee:
CONTACT:+9975118477:Christopher:Tate:
MESSAGERX:+9971265517:1325413834:sed tempor et sit labore ipsum:
MESSAGERX:+9961792880:1325401304:π MEGA65 lorem elit sed:
MESSAGERX:+9997053229:1325377469:eiusmod amet MEGAphone consectetur tempor et sit sit:
MESSAGERX:+9912477184:1325340580:✨ππΆπΆπ sit eiusmod incididunt dolore sed:
MESSAGERX:+9991073060:1325349012:π¨π±ππΉπ³πΈ tempor amet ut aliqua amet ut sit elit consectetur sed:
MESSAGERX:+9911406450:1325396204:adipiscing elit lorem elit dolore labore labore sed magna elit π₯πππ€¬π±πΏ:
MESSAGETX:+9961792880:1325418258:ππ sed amet adipiscing amet labore ipsum incididunt:
MESSAGETX:+9912208959:1325369837:amet dolor labore π€―π³ππ΄π:
MESSAGETX:+9911406450:1325447029:magna adipiscing et ipsum dolor dolore incididunt sed ut:
MESSAGERX:+9911406450:1325514412:et amet sed do amet adipiscing ut ut et amet:
The only mistake it made was to use +99 as an unallocated prefix, instead of +999. But that was an easy fix.
So now I need to make the Linux tool that populates the CONTACT0.D81 with the contacts from such a file. Once I have that working, then I can work on the logging of messages.
We want to convert the CONTACT lines above into the binary format I defined in an earlier blog post. But before we can do that, we need to be able to encode and decode fields in records. We'll also want to allocate and free records, too. So I'll tackle all that first.
Let's start with building our 512 - 2x2 = 508 byte records (split by the track and sector marker for the 2nd 256-byte logical sector). We can easily go from packed to sectorised format:
void sectorise_record(unsigned char *record,
unsigned char *sector_buffer)
{
lcopy((unsigned long)&record[0],
(unsigned long)§or_buffer[2],254);
lcopy((unsigned long)&record[254],
(unsigned long)§or_buffer[256+2],254);
}
void desectorise_record(unsigned char *sector_buffer,
unsigned char *record)
{
lcopy((unsigned long)§or_buffer[2],
(unsigned long)&record[0],254);
lcopy((unsigned long)§or_buffer[256+2],
(unsigned long)&record[254],254);
}
This lets us then just treat the records as 508 contiguous blocks. We can then have very nice simple code to add, delete and find fields in these records:
char append_field(unsigned char *record, unsigned int *bytes_used, unsigned int length,
unsigned char type, unsigned char *value, unsigned int value_length)
{
// Insufficient space
if (((*bytes_used)+1+1+value_length)>=length) return 1;
// Field too long
if (value_length > 511) return 2;
// Type must be even
if (type&1) return 3;
record[(*bytes_used)++]=type|(value_length>>8);
record[(*bytes_used)++]=value_length&0xff;
lcopy((unsigned long)value,(unsigned long)&record[*bytes_used],value_length);
(*bytes_used)+=value_length;
return 0;
}
char delete_field(char *record, unsigned int *bytes_used, unsigned char type)
{
unsigned int ofs=2; // Skip record number indicator
unsigned char deleted=0;
while(ofs<=(*bytes_used)&&record[ofs]) {
unsigned int shuffle = 1 + 1 + ((record[ofs]&1)?256:0) + record[ofs+1];
if ((record[ofs]&0xfe) == type) {
// Found matching field
// Now to shuffle it down.
// On MEGA65, lcopy() has "special" behaviour with overlapping copies.
// Basically the first few bytes will be read before the first byte is written.
// The nature of the copy that we are doing, shuffling down, is safe in this context.
lcopy((unsigned long)&record[ofs+shuffle],(unsigned long)&record[ofs],(*bytes_used)-ofs-shuffle);
(*bytes_used) -= shuffle;
deleted++;
// There _shouldn't_ be multiple fields with the same type in a record, but who knows?
// So we'll just continue and look for any others.
} else ofs+=shuffle;
}
return deleted;
}
char *find_field(char *record, unsigned int bytes_used, unsigned char type, unsigned int *len)
{
unsigned int ofs=2; // Skip record number indicator
while(ofs<=(*bytes_used)&&record[ofs]) {
unsigned int shuffle = 1 + 1 + ((record[ofs]&1)?256:0) + record[ofs+1];
if ((record[ofs]&0xfe) == type) {
*len = ((record[ofs]&1)?256:0) + record[ofs+1];
return &record[ofs+2];
}
ofs+=shuffle;
}
return NULL;
}
Note that we have reserved the first two bytes of each record to have an immutable record number. We can't use the track and sector marker for this, because if the records get sorted, their unique identifier would change. For the main contacts database we want those to remain in order and unchanging: Contacts can only be added, deleted or edited there. It's only in the sorted lists that they will be rearranged, and where those record numbers will be used to point back to the original record, which is the one that will actually get edited.
We'll deal with those sorting and searching functions shortly. But first, lets actually build, store and retrieve actual contact records, and pre-set the record numbers on the disk image ready for use.
Lots of squashing the usual variety of bugs later, it looks like I can write contact records to the CONTACT0.D81 disk image:
00000000 01 01 3f 00 00 00 00 00 00 00 00 00 00 00 00 00 |..?.............|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 01 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000200 01 03 00 00 02 06 4a 65 72 72 79 00 04 09 57 69 |......Jerry...Wi|
00000210 6c 6c 69 61 6d 73 00 06 0d 57 69 6c 6c 69 61 6d |lliams...William|
00000220 73 00 01 c7 de 70 00 00 00 00 00 00 00 00 00 00 |s....p..........|
00000230 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000300 01 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000400 01 05 00 00 02 08 59 6f 6c 61 6e 64 61 00 04 09 |......Yolanda...|
00000410 41 6e 64 65 72 73 6f 6e 00 06 0d 41 6e 64 65 72 |Anderson...Ander|
00000420 73 6f 6e 00 01 c7 de 70 00 00 00 00 00 00 00 00 |son....p........|
00000430 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000500 01 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000510 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
... well, the names are written, but we're writing the last name twice, instead of the phone number. But that's an easy fix.
Now that I have all the machinery for contacts, I should be able to do much the same for messages, so that I have the import direction complete.
Importing messages is more complex, because we have to look up the contact from the incoming phone number. And to do that, we need our sorted versions of the CONTACT0.D81 file. So I guess I need to make the sorting routines next.
We want the sort routine to be (a) reasonably fast/efficient; and (b) still able to run on a MEGA65 without attic RAM. This is a bit annoying, because we don't have enough RAM to load a whole D81 into RAM to sort the records. We can probably read 64KB or 128KB at a time, sort those sectors and write them back, and then step forward through the disk image sorting each portion. If we make those portions overlap, and check if we changed the order of any sector in any portion, we can repeat the process until the list is sorted. Basically at the large-scale it's a bubble sort. At the fine scale, we could get fancier than a bubble sort.
Okay, after poking about and using ChatGPT, I've settled on a quick sort for the inner sort, and then a merge sort of the resulting 10 sorted 80KB slabs. This requires only one full read of the input disk, a full write to a scratch disk image, then reading that scratch disk image track-at-a-time, and then writing linearly to the sorted disk image.
That's much less bad than my naive approach. It does increase the complexity a bit, but hopefully not too much. I'm going to write some tests for this, because I can just smell that it's going to be a great breeding ground for esoteric bugs.
Once it is working, it will be great, though, because it exercises all sorts of things in this system, including sectorisation and desectorisation of data structures, searching for fields in records and goodness-knows-what else.
Making progress there. I do get records written into the sorted D81 -- but every record is the same. It looks like the records in the scratch D81 are sorted correctly for each slab. So it must be the merge that is wonky. Ok, found the problem: I wasn't fetching the record to actually be written, so it was just stamping out the old contents of the sector buffer each time. But with that fixed, the darn thing seems to work! Here's the contacts sorted by first and last names:
$ hexdump -C PHONE/SORT02-0.D81 | tail -42
*
000c7600 01 07 00 00 02 07 43 61 72 6d 65 6e 00 04 06 53 |......Carmen...S|
000c7610 6d 69 74 68 00 06 0d 2b 39 39 39 38 36 33 30 32 |mith...+99986302|
000c7620 39 34 35 00 00 00 00 00 00 00 00 00 00 00 00 00 |945.............|
000c7630 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7700 01 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7710 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7800 01 09 00 00 02 05 45 6d 6d 61 00 04 08 4a 6f 68 |......Emma...Joh|
000c7810 6e 73 6f 6e 00 06 0d 2b 39 39 39 37 33 30 31 34 |nson...+99973014|
000c7820 35 31 32 00 00 00 00 00 00 00 00 00 00 00 00 00 |512.............|
000c7830 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7900 01 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7910 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7a00 01 03 00 00 02 06 4a 65 72 72 79 00 04 09 57 69 |......Jerry...Wi|
000c7a10 6c 6c 69 61 6d 73 00 06 0d 2b 39 39 39 32 30 39 |lliams...+999209|
000c7a20 31 35 30 36 30 00 00 00 00 00 00 00 00 00 00 00 |15060...........|
000c7a30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7b00 01 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7b10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7c00 01 0b 00 00 02 07 4e 69 63 6f 6c 65 00 04 07 48 |......Nicole...H|
000c7c10 6f 6c 6d 65 73 00 06 0d 2b 39 39 39 36 36 30 34 |olmes...+9996604|
000c7c20 39 33 37 32 00 00 00 00 00 00 00 00 00 00 00 00 |9372............|
000c7c30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7d00 01 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7d10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7e00 01 05 00 00 02 08 59 6f 6c 61 6e 64 61 00 04 09 |......Yolanda...|
000c7e10 41 6e 64 65 72 73 6f 6e 00 06 0d 2b 39 39 39 38 |Anderson...+9998|
000c7e20 35 34 32 33 30 39 38 00 00 00 00 00 00 00 00 00 |5423098.........|
000c7e30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7f00 01 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7f10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c8000
$ hexdump -C PHONE/SORT04-0.D81 | tail -42
*
000c7600 01 05 00 00 02 08 59 6f 6c 61 6e 64 61 00 04 09 |......Yolanda...|
000c7610 41 6e 64 65 72 73 6f 6e 00 06 0d 2b 39 39 39 38 |Anderson...+9998|
000c7620 35 34 32 33 30 39 38 00 00 00 00 00 00 00 00 00 |5423098.........|
000c7630 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7700 01 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7710 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7800 01 0b 00 00 02 07 4e 69 63 6f 6c 65 00 04 07 48 |......Nicole...H|
000c7810 6f 6c 6d 65 73 00 06 0d 2b 39 39 39 36 36 30 34 |olmes...+9996604|
000c7820 39 33 37 32 00 00 00 00 00 00 00 00 00 00 00 00 |9372............|
000c7830 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7900 01 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7910 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7a00 01 09 00 00 02 05 45 6d 6d 61 00 04 08 4a 6f 68 |......Emma...Joh|
000c7a10 6e 73 6f 6e 00 06 0d 2b 39 39 39 37 33 30 31 34 |nson...+99973014|
000c7a20 35 31 32 00 00 00 00 00 00 00 00 00 00 00 00 00 |512.............|
000c7a30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7b00 01 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7b10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7c00 01 07 00 00 02 07 43 61 72 6d 65 6e 00 04 06 53 |......Carmen...S|
000c7c10 6d 69 74 68 00 06 0d 2b 39 39 39 38 36 33 30 32 |mith...+99986302|
000c7c20 39 34 35 00 00 00 00 00 00 00 00 00 00 00 00 00 |945.............|
000c7c30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7d00 01 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7d10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7e00 01 03 00 00 02 06 4a 65 72 72 79 00 04 09 57 69 |......Jerry...Wi|
000c7e10 6c 6c 69 61 6d 73 00 06 0d 2b 39 39 39 32 30 39 |lliams...+999209|
000c7e20 31 35 30 36 30 00 00 00 00 00 00 00 00 00 00 00 |15060...........|
000c7e30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c7f00 01 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000c7f10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
I know I shouldn't be amazed, but it's still a really nice feeling, because while this code is running under Linux, it's been written using the faked MEGA65 C API, and should be able to be easily compiled for the MEGA65 when I get to that point.
But first, let's make our export utility, that will let us extract the contacts we've loaded in, so that we can make unit tests that populate and then check.
Okay, that's now working -- and thanks to all the foundation work to date, wasn't particularly hard to do. So now I can run a command like this:
$ python3 src/telephony/sms-stim.py 5 10 ; \
make src/telephony/linux/{import,export,provision} \
&& src/telephony/linux/provision \
&& src/telephony/linux/import stim.txt \
&& src/telephony/linux/export export.txt
And then export.txt will end up containing something like this:
$ cat export.txt
CONTACT:+99920915060:Jerry:Williams:0:
CONTACT:+99985423098:Yolanda:Anderson:0:
CONTACT:+99986302945:Carmen:Smith:0:
CONTACT:+99973014512:Emma:Johnson:0:
CONTACT:+99966049372:Nicole:Holmes:0:
And they match! The only difference is that exporting contacts includes the number of unread messages for each contact. I've since updated the import program to allow setting those values, too.
So the next step is to import and export messages. Exporting will be fairly easy. Importing is a bit fiddlier. We need to have our contact index by phone number working, so that the messages can be efficiently inserted. So let's work on making the indices.
As a reminder, the indices will index every two-byte pair in the field being indexed. But since we have only 1,580 sectors available in an index D81, each of which can hold two index pages, we can't index all 256x256 combinations. Our practical limit is 56x56 = 3,136 halves = 1,568 full sectors.
Now, it would be nice if we could have had 64x64, as we could just mask bytes into the 0x20 -- 0x7F range, which would make indexing unicode UTF-8 sequences a bit nicer easier. We might just be able to do it, if we accept the trade-off of having index pages spanning sectors. This is because each index is a 1-bit bitmap of 1,580 bits = 198 bytes. 790K / 198 bytes = 4,085, which is just under 64x64. Drat! But 799KB (less 8 bytes per KB for CBM DOS compatible sector links) is just over 4096x198 bytes. So if we limit ourselves to a single directory sector we can make it work. It's a bit of a hack.
The question is whether it's worth the hack? I'd have to modify the disk image provisioning stuff to handle it. Index searches would require up to twice as many sector accesses, too. The more I think about it, I don't think it is worth it. Instead we can just use repeated subtraction to wrap all byte values into the modulo 56 space. For 8-bit values we only need to subtract at most 4 types, which feels reasonable. The only niggling concern I have is if this results in unfortunate aliasing. Both problems could be mitigated by using a custom byte mapping with a 256-byte lookup table. Probably requires barely more RAM than implementing the repeated subtraction.
An related topic is selecting between case sensitive and case-insensitive indexing. The norm is case-insensitive, so we'll just make our 256-byte mapping table handle that.
I think the mapping table can just be 1:1 except for the upper/lower case chars. Forming sieves of each diphthong's bitmap should let us filter very efficiently, I think. I could get all fancy pants and add the bitmaps together so that we can get a sense of quality of match via number of matching diphthongs, that would also allow for ranking and thus accepting partial matches where no exact matches exist. That's outside of our core requirements, and would require a 1,580 byte structure when searching, which would be a similarity score of each message slot to the query text, which will be really nice.
And maybe map space, tab, carriage-return and line-feed all to the same value, so that the type of white-space doesn't matter so much? But I'm not even sure that will be necessary if I have this nice similarity score thing happening.
So let's make that mapping table with just case folding, make the index builder, and see how it works. For efficiency, we don't want to do random reads and writes as we go over each diphthong in a message. So we should probably build the bitmap that represents the diphthongs present in the message being indexed (requires only 1,580 bits), and then go linearly through the index reading and updating each index -- not just setting the bits for the present diphthongs, but making sure we clear those not present in the record. This will make creating or modifying a contact or message take perhaps a second or two. But that feels acceptable. Otherwise we risk indices getting out of sync, which will make the search function mess up. So we'll start with that, and optimise it if we need to.
I've refactored a bunch of code for this, in part to handle if the work area we have is no the current 128KB. This is because if a message or contact is being added or edited, we will want to have the display active. But we need space in the precious chip RAM for the unicode font glyphs. It might be that I can still keep enough space for those and a large enough work area for the indexes and sorting. If so, great. But I suspect I'll have to reduce it from 128KB to perhaps 40KB or even smaller. We'll see. But the main thing is that we're prepared for it now.
We will want to take considerable care to keep the indices up to date, because to build the index for all contacts or all messages in a thread requires a complete pass over the index D81 for each contact or message. That can take a long time! Later, I should make this much faster by using a substantial work area to generate the index for the contacts and messages in batches. This should happen in a separate helper program with no messages on the screen, so that we can build the index bitmaps for perhaps a few hundred records in one go, and then only need a few passes, instead of currently potentially requiring 1,680 passes if a thread or contact D81 were full... which could take a couple of hours.
I'll get it working in this inefficient way first, and then work on optimising it.
But first I've gone through and refactored out all large buffers used by the different modules (eg sort vs indexing) that cannot call each other. By "large" I mean anything at all, since our code + data size is limited to about 40KB with CC65. But my initial target was the 508 record buffers that I need for sorting and indexing. I also made all the header files reentrent, as the dependencies between them is starting to get a bit more complex.
It really would be nice if the TE0725's with HyperRAM were in stock, so that we could have the extra 8MB as Attic RAM like on the desktop MEGA65. This would let us do all index generation in a single pass, and deal with this whole efficiency problem that way. I'll implement it all to work without Attic RAM for now, but I'll then probably just make some #ifdefs to switch to much more efficient algorithms (and not using precious chip RAM for the 128KB work area) when Attic RAM is available.
But let's keep on task for now, to actually produce the contact/message thread indexes.
And now I've broken something. Sorting the contacts D81 is failing. Found it: Just a stupid error in the slab_read() refactor, where I had it reading the slab from the wrong disk image.
Okay, so something is being written to the indexes now when I create them. But something is broken. Only very few index pages get set. And none of the index page bytes have more than a single bit set. All highly suspicious. Sure, I have only 5 contacts I'm indexing, but I still really expect more than 10 bits to be set in the index.
Yup, it looks like only diphthongs 0 and 0x780 are being set, and are being set for each contact. Let's figure out what's happening there. But I'm also not convinced that even those are being written out correctly, since the 10 bits that are being set are not localised to the two index pages, but a single bit each in 10 index pages.
Okay, so first problem there was that I wasn't actually using the table to map characters to the modulo 56 number space. As a result totally invalid diphthongs were being generated. I've fixed that now, but they aren't all being written. Something is still going wonky, possibly in writing them out, rather than the code that is generating the set of diphthongs now. This is what I see:
DEBUG: Diphthong 0x00b is in the indexed text.
DEBUG: Diphthong 0x26d is in the indexed text.
DEBUG: Diphthong 0x12a is in the indexed text.
DEBUG: Diphthong 0x402 is in the indexed text.
DEBUG: Diphthong 0x409 is in the indexed text.
DEBUG: Setting bit for diphthong 0x00b from record 1 at slab offset 0x00b00, slab 0
DEBUG: Setting bit for diphthong 0x12a from record 1 at slab offset 0x12a00, slab 0
DEBUG: Setting bit for diphthong 0x80b from record 1 at slab offset 0x08b00, slab 3
DEBUG: Setting bit for diphthong 0xa6d from record 1 at slab offset 0x06d00, slab 4
DEBUG: Setting bit for diphthong 0x100b from record 1 at slab offset 0x10b00, slab 6
DEBUG: Setting bit for diphthong 0x126d from record 1 at slab offset 0x0ed00, slab 7
DEBUG: Setting bit for diphthong 0x1402 from record 1 at slab offset 0x00200, slab 8
DEBUG: Setting bit for diphthong 0x1409 from record 1 at slab offset 0x00900, slab 8
Something funny is going on with the diphthong numbers it thinks are relevant for each index page. Found that bug. Now it's less bad looking, but it's still messed up somewhere:
DEBUG: Diphthong 0x00b is in the indexed text.
DEBUG: Diphthong 0x26d is in the indexed text.
DEBUG: Diphthong 0x12a is in the indexed text.
DEBUG: Diphthong 0x402 is in the indexed text.
DEBUG: Diphthong 0x409 is in the indexed text.
DEBUG: Setting bit for diphthong 0x000b from record 1 at slab offset 0x00b00, slab 0
DEBUG: Setting bit for diphthong 0x012a from record 1 at slab offset 0x12a00, slab 0
DEBUG: Setting bit for diphthong 0x026d from record 1 at slab offset 0x12d00, slab 1
DEBUG: Setting bit for diphthong 0x0402 from record 1 at slab offset 0x04200, slab 3
DEBUG: Setting bit for diphthong 0x0409 from record 1 at slab offset 0x04900, slab 3
DEBUG: Setting bit for diphthong 0x080b from record 1 at slab offset 0x08b00, slab 6
DEBUG: Setting bit for diphthong 0x092a from record 1 at slab offset 0x06a00, slab 7
DEBUG: Setting bit for diphthong 0x0a6d from record 1 at slab offset 0x06d00, slab 8
DEBUG: Setting bit for diphthong 0x0c02 from record 1 at slab offset 0x0c200, slab 9
DEBUG: Setting bit for diphthong 0x0c09 from record 1 at slab offset 0x0c900, slab 9
We're now writing all 5 of the dipthongs, but we're duplicating them, with 2048 added to them, by the look of things. Okay, that problem was using a char instead of a 16-bit type for the index into the diphthong bitmap.
The indexes written to disk now look plausible! But I won't believe it until I've verified it. I'll make a little search utility that will use the index to find matching records, and calculate their similarity scores, and then rank them.
Searching using the index is pretty simple, we just have to count up the number of hits against each diphthong against each record. Then we'll have a nice score fore each record. Sort those by descending score, and we'll have a ranked set of results. Alternatively, leave them unsorted for it to act as a filter. We'll probably have an option to toggle that sort order. But first, let's calculate those scores.
We also want search queries to be able to be quickly edited and have the results update in more or less real-time. This includes both inserting and deleting characters. We might also want to add or delete multiple at a time, e.g., for UTF-8 multi-byte sequences, so we should separate updating the query and query state from providing the optionally sorted search results.
The API so far is looking like this:
char search_query_init(void);
char search_query_release(void);
char search_query_append(unsigned char c);
I'll add functions for deleting characters at end or ranges of characters at specific positions within the query -- or inserting them -- as well as the function to ask it to update the query results, and select whether to sort the results, or leave them in their natural order.
Okay, I've added these now, too:
char search_query_delete_char(void);
char search_query_delete_range(unsigned int first, unsigned int last);
I still don't have the code to filter and sort, but let's start by testing what we have first, by making a linux command line tool for testing search. First step done: I can run a query from the command line utility and get the scores of the contacts (I'm searching by first name here):
$ make src/telephony/linux/search \
&& src/telephony/linux/search PHONE/CONTACT0.D81 PHONE/IDX02-0.D81 "Nicole"
make: 'src/telephony/linux/search' is up to date.
INFO: Disk image 'PHONE/CONTACT0.D81' mounted as drive 0
INFO: Disk image 'PHONE/IDX02-0.D81' mounted as drive 1
1:1:
2:2:
3:1:
4:1:
5:6:
INFO: Completed.
$ cat export2.txt
CONTACT:+99920915060:Jerry:Williams:0:
CONTACT:+99985423098:Yolanda:Anderson:0:
CONTACT:+99986302945:Carmen:Smith:0:
CONTACT:+99973014512:Emma:Johnson:0:
CONTACT:+99966049372:Nicole:Holmes:0:
And it works! Contact 5 gets a score of 6, because it is a strong match to my query. Record 2 also gets a hit, because "Yolanda" contains the diphthong "ol" from our query.
However, it is still being case-sensitive for some reason. I can only assume I messed up my lookup table. Yup -- out by one error in position of the lower-case alphabet. Now it works just fine with any case.
So now we can work on getting the filtered list of results, and then having them optionally sorted by score. I've implemented these functions now:
char search_query_rerun(void);
char search_collate(void);
char search_sort_results_by_score(void);
So let's update our search program to show the matching records, and to ignore those with score == 1, since those don't actually match the query -- they're just valid. Okay, all done, and it works :)
$ make src/telephony/linux/search \
&& src/telephony/linux/search PHONE/CONTACT0.D81 PHONE/IDX02-0.D81 "Jerry"
make: 'src/telephony/linux/search' is up to date.
INFO: Disk image 'PHONE/CONTACT0.D81' mounted as drive 0
INFO: Disk image 'PHONE/IDX02-0.D81' mounted as drive 1
1:5:Jerry:Williams:+99920915060:0
INFO: Completed.
$ make src/telephony/linux/search \
&& src/telephony/linux/search PHONE/CONTACT0.D81 PHONE/IDX02-0.D81 "Nicole"
make: 'src/telephony/linux/search' is up to date.
INFO: Disk image 'PHONE/CONTACT0.D81' mounted as drive 0
INFO: Disk image 'PHONE/IDX02-0.D81' mounted as drive 1
5:6:Nicole:Holmes:+99966049372:0
2:2:Yolanda:Anderson:+99985423098:0
INFO: Completed.
Okay, so exact and fuzzy matching are both working. So what's next? Probably importing the SMS messages, now that we have a way to find the corresponding contact by looking for the contact that corresponds to the originating phone number, and then updating the unread message count and the message thread index to support searching within (but not between) threads.
This really shouldn't be hard now that we've got all this foundation work done.
Messages from unknown numbers we should just map to a hard-wired "UNKNOWN Numbers" contact, that possibly we declare to always occupy contact record 1 for efficiency and simplicity. In that case, we'll record the orginating number with the messages, so that if the user creates a contact for them later, we can find all the relevant messages and pre-populate the message thread for that contact. Actually, we can just always store the originating number.
A couple of hours later I have message importing working fairly well -- except that the search for the contact by telephone number is completely failing. This is weird, because I can search for the phone number using my command line tool, and it works just fine. If I disable that, and just let it do an O(n) trawl through the contacts D81 to find the contact, then they do get stored (yet to confirm the thread MESSAGES.D81 really has the messages stored, but the unread message count goes up for SMS messages received :).
That problem turned out to be that the import program wasn't generating the contact index by phone number until the end --- so the index was legitimately empty when the search happened.
With that fixed, it now looks good.
Next step is to make a message thread search utility, that with no query string will show all messages in a thread with a contact (also selected by a search string), optionally sorted by relevance.
Okay, I have that working without the message thread search query, like this:
$ make src/telephony/linux/thread && src/telephony/linux/thread "Nicole"
make: 'src/telephony/linux/thread' is up to date.
CONTACT:Nicole:Holmes:+99966049372:2
MESSAGETX:+99966049372:398185456:π³πΏ Quasi nisi quidem, quis veniam sed numquam ipsam quo hic amet molestiae? Veritatis cupiditate ullam nihil et tenetur doloribus, accusantium:
MESSAGERX:+99966049372:398196160:Dolores excepturi facilis atque quas labore rerum, libero voluptates modi, dolorum delectus dolores praesentium rerum dignissimos ππ₯Ί:
MESSAGERX:+99966049372:398235394:Beatae consequuntur similique rerum et, est nostrum MEGAphone quo quidem facilis in corporis π¦:
MESSAGETX:+99966049372:398298378:Impedit dolorem ut neque quo placeat eaque necessitatibus culpa blanditiis, cupiditate velit minima ducimus deleniti esse.:
INFO: Returning conversation only for best matching contact.
The problem is that the message thread index seems to be bonkers. It has bits set in it, but there are bits for messages that don't exist. So I need to figure out what's going wrong there. The first part of the problem was the physical sector markers that we have in the CONTACTS disk images. But they were being set in all D81s that were being provisioned. With that fixed, we no longer have any spurious bits set that don't correspond to real messages.
So why is the query resulting in no matches? The search utility that is really made for contacts, rather than message threads does find hits. So I guess this means the problem is more likely in threads.c. Modifying search so that it can show the body text of messages that match, it looks like the index contents are all fine.
Found and fixed -- another simple PEBCAC: For some reason I had lost the line that actually added the query text into the search. With that fixed, it looks like it's working now:
$ src/telephony/linux/thread "Nicole" "MEGA"
CONTACT:Nicole:Holmes:+99966049372:2
MESSAGERX:+99966049372:398236805:Beatae consequuntur similique rerum et, est nostrum MEGAphone quo quidem facilis in corporis π¦:
INFO: Returning conversation only for best matching contact.
So, I think that's actually the last of the ingredients I need to actually start getting this all ready to run natively. I'll have to adapt the code, and start adding the UI stuff, like actually rendering messages and contacts. But with the stimulator, I can populate a fairly extensive set of contacts and messages for each, so that there's something to see in the various views.
Anyway, this post has dragged on more than long enough, so I'll stop it right now, and start work on getting this transitioned to native MEGA65 running to display contacts etc.
No comments:
Post a Comment