NEWS, EDITORIALS, REFERENCE

Subscribe to C64OS.com with your favorite RSS Reader
March 25, 2021#110 Programming Theory

Shared Libraries

Post Archive Icon

Before we get started on the topic of shared libraries, I wanted to draw attention to a recent update to the Commodore 8-Bit Buyer's Guide. For the last couple of months, I've been scouring the web and reaching out to creators and vendors, collecting links, prices and photos of new products that are commercially available for the Commodore 64 and other Commodore 8-bit machines.

I maintain the catalog in a spreadsheet, with product name, supplier, creator, category, price, notes and a URL to where the product can be ordered. I try to find the highest resolution images that I can, and then carefully remove any background so that each image has a consistent matted-on-white appearence. Often the images will have their light and dark, brightness, contrast, and sharpness adjusted to try to give them a consistent look. This is not always possible because the images come from such different sources. When necessary, I do color adjustments around the edges and through transparent areas to make the images look as though they were meant to be on white.

I'm particularly pleased with the results of this image:

SuperVideo AV Adapter (Original). SuperVideo AV Adapter (Catalog Prepared).

It is rather unintuitive just how much that pink background influences the internal contents of this image. The transparent case, the brass screws, the shiny chrome metal, they all show hints of pink that stick out like a sore thumb when you lasso the image, mask it, and matte it on white. The cable itself had to be desaturated, but in context you hardly notice, because your eyes are drawn to the main object, which still shows the colors on the RCA jacks, the blue adjuster, the wires and cable tie.

 

Suffice it to say, putting the Buyer's Guide together is a lot of work. This update involved the addition of 76 new items. Staggering. I never thought the guide would reach 76 items in total. In addition to adding new items, I also removed around 25 items for which I could no longer find a source. The net new addition of 50 items brings the total to over 450 products, projects and kits. Some of them are similar to others. e.g., There are eight variations of PLA replacement chips, and fifteen different styles of power supply. There is also a vast range in sophistication of items, from the most complex, such as the Ultimate 64 Elite, the Beam Racer, and ARMSID, all the way down to the lowly rubber foot replacement on the bottom of a C64c case, and everything in between. Something for every nook and cranny of a Commodore 8-bit user's hobby life.

Creating feature pages became an incredibly daunting task, as the sheer number of products grew. At the moment, with such a small percentage of items in the guide having a feature page, the guide is primarily a visual exposé of what is out there. You can use the name of the item to search for it, and what you can't find, you can always take a snap of the image and enquire about it on social media. When I get a chance, I will work on a way to dynamically link the entries in the guide—the ones that lack a feature page—to their source online, via a CSV export of the spreadsheet I maintain.

In the meantime, enjoy exploring the guide and seeing what cool stuff is out there in 2021 for our beloved Commodore 8-bit computers.


Code Relocation and Finding Where Code Is

It's interesting to me how technological developments stack, new ones on top of old ones. When I began working on C64 OS, I planned for Applications but didn't have any plans for Utilities. I didn't have any plan for being able to relocate 6502 code, but I didn't see much use for it at first. GEOS has drivers and desk accessories, but does not use any relocatable code, for example.

As soon as I discovered a painless and simple way to relocate small chunks (256 bytes to ~2 kilobytes) of 6502 code, (Drivers and Relocatable 6502 Code,) it became immediately useful for writing a variety of loadable and unloadable drivers. Several issues soon popped up. How many drivers can there be? How do you tell if a driver is already installed? Where has the driver's code been relocated to?

First, while there may be many drivers, there doesn't seem to be an endless set of categories of drivers. Although GEOS has drivers, it has only two driver types: Input and Printer drivers. There are scores of printer drivers and handfuls of input drivers, but after all this, there are still only two types, two slots in memory into which one each of a wide assortment of drivers can be plugged.

I extended this out to around 8 possible driver types, but if necessary there could be more:

  • Input
    • Pointer Input
      (mouse, koala pad, light pen, etc. to control the mouse pointer)
    • Game Controllers
      (2 and 4 player, extra buttons on controllers, etc. that map to a standard 4-player software interface)
    • Keyboard
      (Maybe USB input, maybe PS/2 input via modified C=Key adapter, maybe C128 extended keyboard.)
  • Output
    • Printer
      (User port parallel, IEC bus adapter, printer language, etc.)
    • Digital Audio
      (DigiMax, Ultimate Audio Module, SUX 6400, etc.)
  • Input/Output
    • Network Hardware
      (User Port RS-232, Turbo 232, SwiftLink, etc. on various I/O pages.)
    • Network Commands
      (Abstracting the command set supported by various WiFi modem firmwares.)

That's 7 slots, plus I've left one or two reserved for something I can't think of. A driver slot covers an I/O category. But what driver actually gets plugged into each slot is up to the user and the hardware they have to occupy that slot.

Second, the relocation is always page-aligned. Let's take a sidetrack to talk about page alignment for a moment.


Aside: Memory Page Alignment

What does page-aligned mean? The C64's main memory is a based upon a 16-bit address bus. 256 * 256 = 65,536 bytes or 64K of memory. All absolute addresses are composed of two bytes, the high byte and the low byte. The high byte specifies one of 256 pages, and the low byte specifies one of 256 bytes within the page.

If a chunk of code or data is page-aligned, that means it starts somewhere where the low byte is zero. E.g., at $4100, $6F00, or $8900, etc. The advantages to page alignment are manifold. The most obvious and documented advantage is that 6502 indexing addressing modes are faster if they don't cross a page boundary. For example:

ldy #$f0
lda $8000,y

…is 1 cycle faster than…

ldy #$f0
lda $8050,y

The explanation is straightforward. To compose the address whence to load the accumulator, the Y register has to be added to the low byte. If this overflows, then an extra step must be taken, the high byte must be incremented. An ordinary increment instruction takes 2 cycles, 1 cycle to read in and decode the instruction and 1 cycle to perform its incrementing behavior. If the indexing mode overflows the low byte, the information that the high byte needs to be incremented is already present, so only 1 extra cycle is required, not 2. (This may be a folk explanation, but it feels intuitively right.)

There are other advantages to page alignment that can make your code faster and smaller. If a chunk of code gets relocated, but you know it is page-aligned, you only need to store one byte to locate the code, the high byte, aka the page byte.

Setting pointers becomes easier and faster:

lda driverpg
sta ptr+1
lda #0
sta ptr+0

…is 2 bytes smaller and 2 cycles faster than…

lda driverpg+1
sta ptr+1
lda driverpg+0
sta ptr+0

How is this 2 bytes smaller? The immediate #0 is one byte smaller than the absolute driverpg+0, but an additional byte of memory is saved because driverpg+1 isn't involved at all. This can be optimized further if, say, you know that a register already holds 0, you can write it to the ptr's low byte without first explicitly loading a register with #0. Or in other circumstances you might know that ptr+0 is already 0, and you can skip setting it altogether.

If what is page-aligned is a data structure, other optimizations are possible. For example, the directory library (which we'll talk about below), needs 32 bytes per directory entry, which it fits into page-aligned 256-byte allocations. One memory page is enough to hold 8 directory entries. This makes the address of each directory entry:

$xx00
$xx20
$xx40
$xx60

$xx80
$xxA0
$xxC0
$xxE0

Now let's imagine we have a pointer to some arbitrary directory entry, and we need to get a pointer to its filename in the X and Y registers. (In all the C64 OS documentation I refer to a pointer in the X and Y registers, X being the low byte, as a RegPtr.) This can then be passed to any routine that needs a RegPtr to a string. Now let's suppose that the filename is at offset 3 in the directory entry struct. How do we pull a RegPtr to the filename from a zero page pointer to some directory entry?

We could do this:

clc
lda ptr+0
adc #fdname ;#3, filename offset
tax ;RegPtr low byte
lda ptr+1
adc #0
tay ;RegPtr high byte

The above code adds the filename offset to the low byte, and then adds #0 to the high byte in case the low byte addition caused the carry to go high. Both bytes have to pass through the accumulator in order to do the addition with carry.

But since we know we have 32-byte structs, and we know that they are aligned to boundaries within the page, we know that an overflow is impossible and the code can be reduced to this:

ldy ptr+1
ldx ptr+0
inx
inx
inx

It's 4 bytes smaller and a few cycles shorter, with the additional benefit that the accumulator and the carry are unaffected. In practice this is a surprisingly big win.


Now, Back to Drivers

So now back to drivers. C64 OS's code relocation mechanism is page-aligned. Drivers get relocated, therefore their code is page-aligned. And therefore we only need to store one byte to locate it. To support 8 drivers, then, a range of 8 bytes is reserved in workspace memory.

To install a driver, some page-aligned memory is allocated, the first page of that block is configured as a file reference to the driver to be loaded. And a KERNAL call is made to loadreloc. This reads the file reference, loads the referred file into the block of memory (overtop of/obliterating the file reference), and relocates the code to run from that block of memory. Finally, for convenience loadreloc returns the page byte.

To complete the installation of that relocated code as a driver, the page byte is written into the reserved workspace address for that driver type. Other code in other places can determine if a driver is installed simply by checking if the workspace address is zero or not.

Uninstalling a driver can be done by overwriting the workspace address with a zero. Although, depending on how memory was allocated, you may have to explicitly deallocate the memory in order to prevent a memory leak. Generally speaking it is up to the user to decide which driver ought to fill a given driver slot. Do you want to use a 1351 Mouse, or a Light Pen, or a Koala Pad? You pick the driver and it gets plugged into that slot.


What is a Library?

At first blush, you might wonder the difference is between a library and a driver. Both are loadable and unloadable at runtime. Both contain a small chunk of reusable code. Both get relocated when loaded, and both need some way for the code that calls into them to be able to locate them. These sound like the same thing!

The primary difference is conceptual, which leads to a different usage pattern. A driver abstracts the main software from the details of a piece of hardware, so that multiple pieces of different hardware can be substituted into the same role. Supply the right driver to bridge the gap between the hardware's details and the software's abstract interface, and the software has no clue what piece of hardware it's interfacing with. That's good. What's more is that the software fundamentally cannot know what driver ought to be loaded, because the whole point of a driver is that the software doesn't know, and doesn't want to know, what hardware device the user has hooked up.

A library, while similarly structured, is conceptually totally different. A chunk of code could easily be useful to more than one program. For example, maybe your program needs to calculate a CRC32, so you set about implementing or porting some CRC32 code to 6502. It works and that's great. But, what about the very next program that also needs to calculate CRC32's? What a waste it would be to have to implement it twice. It makes more sense to create a CRC32 library, put the routines there, and allow all programs that need to calculate a CRC32 to load in the same library.

To further differentiated these two, consider that when it comes to libraries, the program needs the specific functionality provided by the library. To extend the example, it specifically loads in the CRC32 library because it needs to generate a CRC32. Whereas with a driver, the OS has a generic concept of a moveable pointer/input cursor, but which driver is best suited to fill that role is wide open.

While there seems to be a limit to the number of types of drivers there are, with a variety of different drivers that substitute for one another to cover a single type in different ways, the number of libraries that there might be seems truly unlimited. Windows has thousands and thousands of DLL's (Dynamically Linked Libraries.) Basically any chunk of code that could be usefully reused by another application can be shaved off an application's main binary and turned into a library.

There is even some value in an application dividing its main code into a few libraries even if those libraries will never be used by anything else. The libraries could hold the code for functionality that doesn't get used all the time. You can speed up load time and decrease the memory footprint by deferring the loading of a library until the user tries to use a feature that needs it. You might even be able to unload a library while the application is running to free up some needed memory.

Here's a list of the libraries that currently exist in C64 OS: (This list is sure to grow.)

Library Size* Description
cmp.lib 1 Provides common set of sort comparators, including configurable natural string comparator
crash.lib 2 Handles unhandled exceptions, presenting a crash dialog box
dir.lib 4 Reads and parses directories
gfx.lib 3 Manages graphic data and handles split and fullscreen graphic modes
grab.lib 1 Takes screenshots and saves them by date in //os/screengrabs/
i2c.lib 3 Implements the I2C bus over the user port
iec.lib 2 Provides detection of device types and device numbers on the IEC bus
path.lib 2 Performs path management operations on file reference structures
qsort.lib 1 Implements the QuickSort routine
rec.lib 1 Performs presence and capacity detection of 17xx REU expanded memory
utils.lib 3 Processes the utilities.m file and rebuilds the Utilities menu
* Sizes are measured in 256-byte pages

The Need for Shared Libraries

Although C64 OS isn't multi-tasking proper, it allows for one main Application and one Utility to be running concurrently, plus system code. Plus, an Application or Utility could attach a TSR (terminate and stay resident) type program, such as for playing background SID music, or logging activity, etc.

The File Manager is an application that needs to load and process directories. But other applications need to process directories too, so it's very convenient to have that functionality in a library.

I first used the directory library in the Utilities Utility. It loads the contents of the system's utilities directory, sorts and displays them in a scrollable list. But because Applications and Utilities run at the same time, they both need the directory library at the same time. The initial implementation worked, but only because each loaded its own copy of the library.

At the time that I wrote the post, Anatomy of a Utility, the directory library was 7 pages. It embedded the functionality of loading/parsing/freeing directories, implementing and configuring directory-entry-specific sort comparators, and the QuickSort routines plus two pages used to hold the pointers that get sorted. All of this in one giant lump of 7 pages. And then all 7 pages have to be allocated a second time for the Utilities Utility.

The functionality is literally identical. It's 2 full copies of the exact same code that get loaded and relocated to 2 different consecutive 7-page blocks of memory. 7 / 256 = 0.0273 that's just shy of 3% of total system memory. It feels almost criminal to throw away 7 pages of memory while scrimping in other places to save a few bytes here and there.

Clearly, the answer is that the system needs support for shared libraries. The question is, how to implement shared libraries? What has to change to make a library shared, and what are the gotchas?

Loading Libraries

The C64's KERNAL ROM offers a range of routines, some lower level than others. For example, there are very low-level routines, (TALK/LISTEN, TALKSA/SECOND, UNTLK/UNLSN, etc.) that manipulate CIA 2 to implement the IEC bus protocol. These send and receive individual bits. On top of that, there are I/O routines, (CHKIN/CHKOUT, CHRIN/CHROUT,) for reading and writing whole bytes. Lastly, there are even higher level File routines, such as LOAD/SAVE that transfer whole chunks of memory.

Thus the KERNAL is not a flat assortment of routines. Some depend on others, like this:


IEC bus routines ← single byte I/O routines ← File load/save routines

There is a similar hierarchy in the C64 OS KERNAL, which stack on top of the KERNAL ROM. There are routines that allocate memory. That memory can be configure as a file reference, and there are routines for loading and relocating data by file reference. C64 OS's low-level loading, in turn, is performed by the C64 KERNAL ROM's LOAD routine.

Let's see how a driver is loaded:

This code doesn't show changing from one driver to another. It doesn't include loading a driver name from a config file, or uninstalling/deallocating this driver. But it does show how a driver gets from the storage device, into available memory, code-relocated, and installed. The final "sta jdrvpg" puts the page # of the place to which this driver was loaded into the slot for this type of driver.

You can also see that it is fairly manual; allocate the memory, configure the file reference, call loadreloc (which can be used for any kind of relocatable, not just drivers.) This driver can then be found by other parts of the system via the entry at jdrvpg.

Now let's see how I used to load (non-shared) libraries.

This code is slightly modified from the version that appeared in Anatomy of a Utility, to reflect the fact that the directory library has been split out from sorting and comparators, but we'll come back to that later. And (to the watchful reader with the eagle eyes) the pathnames are no longer preceded by a "/". Once I started working with the File Manager and the the path library, I had to standardize path manipulation. There is an updated description of how paths are structured and managed in the post before this one, Introduction to File Manager.

This code is also strikingly similar to the code above for loading a driver, but I've introduced some standard macros that make the code easier to read. (#stradd and #rdxy.)

The main difference between loading the driver and loading this library is the installation of the driver to jdrvpg vs. the dynamic linking of the routines that we'll use from this library. Presumably the code to load this library is somewhere in the application's main binary. And in that same main binary it has a short jumptable, freedir jmp freedir_ and readdir jmp readdir_. After the library is loaded, we link the routines by overwriting the high bytes in the jump table. (Another advantage of the relocated library being page-aligned.)

Is the library loaded? How many times is it being used?

Let's suppose that the library loading code above were used by the Application, but now we load in a Utility that needs to make use of the same library. How does the Utility know whether the library is already loaded? If it is loaded, how does the Utility find the location of the library?

What we want is for each process that needs a library to make a call to load it. Some internal mechanism should be used to discover that the requested library has already been loaded, and return a reference to it. The process that needs the library doesn't have to know whether it got an existing reference or whether the it was freshly loaded just for it.

But now, of course, there is the problem of what to do when the library is no longer needed. Whenever a process is finished with a library it should make a call to unload it, but if it only got a reference in the first place that means something else is also using it. A process that merely gets a reference can't force the library to be removed from memory while other processes depend on it.

This problem was solved a million years ago on bigger computers, but is rarely employed on the Commodore 64. The solution is reference counting. A use case might look something like this:

  • A process calls to load a library
  • It gets loaded in from storage, and has its reference count initialized to 1
  • Another process calls to load the same library
  • Its reference count is incremented to 2, and the reference to it is returned
  • Either one of the processes is done with the libary and calls to unload it
  • Its reference count is decremented to 1, and it remains loaded in memory
  • The final process is done with it, and calls to unload it
  • Its reference count is decremented to 0, and it gets unloaded from memory

You might think, yeah yeah, but, C64 OS isn't multi-tasking, what's this talk of different processes needing the same library? Here's a contrived example, but it's well within the realm of possibility.

Typically grab.lib, (for taking screen grabs,) is not loaded into memory. When you use the global keyboard shortcut to take a screen grab, the system code loads in the library, relocates it, runs it, and immediately unloads it. Well now imagine that there is an Application that wants to work with screen grabbing, so it explicitly loads in grab.lib. Well, now what happens if you use the global keyboard shortcut to invoke the system code? The system code's call to load the library will increment its ref count once, take a screen grab, and then decrement the ref count. It will subsequently be left in memory because the Application that loaded it is still using it.

Here's another contrived example involving grab.lib from a slightly different but perhaps more interesting perspective.

Everytime you take a screengrab the system has to load in grab.lib, relocate it and run it, only to unload it moments later. Typically this is useful because taking screen grabs may not be important to you at all. When you do want to take a screen grab, though, it only momentarily takes up 1 page of memory, and then gets unloaded. A user could decide, however, that making screen grabs is very important and it's worth sacrificing one memory page to have grab.lib memory resident all the time. So they set up the system to preload grab.lib. It gets loaded at boot up time, and its reference count is set to 1.

When they use the global keyboard shortcut to take a screen grab, the system calls to load it, but it's already loaded, its ref count goes to 2 momentarily, and a screen grab is taken. The system immediately calls to unload it. Its ref count goes to 1 and it stays in memory for no other reason than to make it quick to invoke the next time. There will someday be other more practical examples of this kind of thing. By the way, this reminds me of how, on the Amiga, you can force specific command line programs to stay memory resident just for speed. Let the user decide which libraries to keep memory resident, balancing their priorities of persistent memory usage and the speed of accessing certain features.

Where is the library? And how to track reference counting?

Knowing where a driver is installed is easy. Each type of driver has a fixed memory address. This works because there are only a few types of drivers, and room for 8 types has been set aside.

What about for libraries though? As discussed earlier, there could be hundreds or even thousands of libraries. We cannot pre-allocate a fixed memory address to refer to every possible library. We need to know if a library is available by looking it up by some identifier. If the identifiers were arbitrarily assigned (such as my implementation for datatypes,) allocation of identifiers would have to be planned and then somehow identifiers would have to be mapped to filenames.

Alternatively we could use the filename itself as the identifier. However, a full filename is 16 characters and we have to support loading multiple libraries. Each additional library we allow to load in would require an extra 16 bytes, plus additional bytes for its memory page address, size and reference count. Let's call it 20 bytes per library. If we support 10 libraries, that's 200 bytes, plus we're going to need the code to do the library loading and unloading and reference counting, etc. Altogether would this even make sense? We're doing this to save memory, not just to do something cool.

All this considered, I settled on a hybrid between a library identifier and looking it up on its filename. The first 2 bytes of a library's filename must uniquely identify it, and its filename must end with ".lib.r" More about this below. I also set the maximum number of libraries simultaneously loadable at 10. This could be increased a bit, if it turns out to be necessary. We need to track 2 ID bytes, 1 page address byte, 1 size byte, plus 1 reference count byte. If we multiply that by 10 libraries, we'd need at least 50 bytes.

I made another couple of concessions. Realistically, no library will ever be reference counted >15 times. We've got the system code, the Application, a Utility, and maybe preloaded to stay resident, that's only a reference count of 4, we'd have to reference count it 12 more times to overflow 15! That will never happen.

Additionally, a library will never exceed 15 pages in size. Why do I say this? Because:

  1. 15 pages is already a huge percentage of total memory (~6%), and
  2. Because the relocation technique hits a limit around 7 or 8 pages (2KB) maximum.

When the combined directory library was 7 pages, it required herculean effort to jigger the bytes around (such as by putting in a sequence of 5 NOPs between two routines) such that X number of consecutive byte values were not used. (Where X is the number of pages to be relocated.) This is a requirement of the technique.

Thus, if size is limited to <16 pages, and reference count is limited to <16 references, these two values can be packed into high and low nybbles, at very little expense. For example, the size is fixed, so if the low nybble holds the reference count the whole thing can be incremented and decremented without affecting the size in the high nybble.

Thus we need, in workspace memory, 40 bytes:

  • A table of 10 first ID bytes
  • A table of 10 second ID bytes
  • A table of 10 page address bytes
  • A table of 10 size/refcount bytes

Now, what about that 2-byte identifier? I hear you asking if that's enough. Firstly, the total raw capacity. Filenames are case sensitive, so if we restrict to only letters and numbers it would be (26 + 26 + 10) * (26 + 26 + 10) = 62 values for the first character * 62 values for the second character = 3,844 library IDs. All installed libraries have to be in the library directory at the same time, so there will never be anywhere near that many. (The DOS would choke searching through a directory with that many files, if something like a CMD HD can even support that many files in a single directory.)

We're implementing this on a 46 year old CPU architecture, with only 64KB of main memory. If BASIC can get away with limiting variable names to just 2 bytes, we can get away with limiting library IDs to just two bytes. In fact, they share some advantages.

Looking up a two byte value is much easier than comparing two strings of different length. Rearranging entries in the tables is much easier. And performing the load and unload calls and their implementations are also easier. For example, to specify the library itself a single string can be used for the name pattern like this:

libname
	.null "XX*.lib.r"

When making any KERNAL calls (KERNAL ROM or C64 OS's KERNAL) you only have 3 bytes plus a few bit flags (carry, overflow, and zero flags.) With a 2-byte ID we can load a library like this:

	ldx #"d"  ;directory library
	ldy #"i"
	lda #4
	jsr loadlib

To preserve the 2-byte ID that was just passed into the loadlib call, it can stash them directly into the filename pattern, like this:

loadlib ;Load Library KERNAL routine

	stx libname+0
	sty libname+1

	...

	rts

The size of the library is stored so it can be used for automatic deallocation of the memory, when unloading reduces the library's reference count to zero. When the page allocator allocates a block of consecutive pages, the whole block is initialized as a memory pool from which malloc memory allocations can be made. This initialization includes the first byte of the block containing the number of consecutive pages in this block. This can (in theory) be used to look up the size of a block for automatic deallocation. However, in practice, a block of allocated pages can be used for anything, not just as a memory pool for malloc.

In the case of using an allocated block as space for a relocated binary, the relocated code overwrites everything in the block, including the memory pool initialization bytes. Thus, the loaded library tables need to track the size of each library.

Transitioning from Library to Shared Library

As I mentioned earlier, just because a shared library system exists, there is nothing stopping a program from loading a library manually, it just won't be shared. But it also won't be subject to the same restrictions. Here are some of the restrictions of a shared library that we've already talked about:

  • The first 2 bytes of filename must be unique
  • It must be found in the system's /library/ directory
  • Its filename must end with .lib.r

But what else must we do to transition a library that was not designed to be shared to be able to be shared? The short answer is, nothing inside the library can be used to store state for the process that has loaded and is making use of it. And now for the longer answer, I'll use the example of the directory library, as it was described in Anatomy of a Utility.

Originally I was trying to get the directory library to be as streamlined as possible to use, by offloading as much work as possible from the main binary and into the library. You pass it a file reference pointer, and it loads a directory into a chain of automatically allocated pages. Where does that chain start? It stores the root page byte internally.

You want to sort the directory? You call presort and confcmp to initialize the sorting and configure its parameters. What does presorting do? It reads through the directory chain (internally tracked) and organizes pointers to directory entries into internal tables. When you call sort, it uses the internally configured comparator with the internal quicksort routine to sort the internally stored directory entry pointers.

To get a record, you call record with an X index. But if you want to change the sort direction, a subsequent call to confcmp identifies from internal properties that it is already sorted by the specified parameters. It uses the sort direction bit, though, to rewrite the library's record jump table entry to point to either record_a or record_d. Subsequent calls to record get automatically routed to either the ascending or descending versions of the routine.1

Yes, it's very streamlined to use, because everything is managed internal to the library!

But what happens when two processes try to share that code? Everything goes straight to hell. The Application configures it to load a directory and sets sort parameters and sorts it. But then a Utility gets loaded, and it has the same code load a different directory, and sorts that. Because the Utility has no idea that the library was already holding pointers to allocated memory that memory gets orphaned and therefore leaked. Meanwhile, the Application has no idea that a Utility has changed things from under its feet, and the Application's directory listing becomes the same as the Utility's. Except that nothing makes sense because those files aren't actually found at the location specified by the Application's file reference.

Two things: First, how did the File Manager maintain tabs for 4 different directories? Easy, the File Manager knows that it maintains 4 tabs. When a tab changes, the outgoing tab backed up its own directory's root page by a call into the library to get that page. The incoming tab restored its own directory by writing its saved configuration back to the directory library, and having it reload and/or resort the entries. Second, how did the Utility and File Manager both make use of the directory library originally? Easy. Because it wasn't shared, all the internal state was internal to two different copies of the library.

To convert it to be a shared library, therefore, it is necessary to externalize all of the internally maintained state. This does put some more load and some more glue logic on the Application, without a doubt.

The directory library was made up of 4 main parts:

  • The directory loading/parsing part,
  • The configurable sort comparator (with support for natural string sorting),
  • The quicksort routines, and
  • The index tables (2 pages) that get populated with pointers to entries to sort.

The QuickSort Library (qsort.lib)

The first thing I decided is that QuickSort really should be its own library. It's a useful routine that can be used to sort anything—not just directory entries—and it's sufficiently complicated that you wouldn't want to reimplement it in 6502 if you didn't have to. I pulled it out and was able to optimize it down to a single page. This took some doing, but I literally got it down to 256 bytes on the nose.2

How does QuickSort know what to sort and how to sort them? It still requires two calls: presort and sort. Remember, you have 3 registers to pass parameters to a routine, so sometimes a routine requires a preparatory routine first. Calls to sort must be preceded by presort. Presort takes a RegPtr (X/Y) to a fetch routine. And the Accumulator passes the low page number of two consecutive pages of memory that will be used for the index. These used to be managed internally to the directory library, but they now have to be provided so that each process has its own index.

The fetch routine is called repeatedly. On each call you return a pointer to another record. The type of record doesn't matter. It could be a pointer to a string or a number, or it could be a pointer to a struct or an object. Presort adds the pointers to the index tables. Fetch may return with the zero flag set to indicate that there are no more records to fetch.

With the index tables now populated, call sort with a RegPtr to a comparator routine and the record count in the Accumulator. More on the comparator in a moment. When sort returns, the pointers in the index tables are sorted. Because the index tables are manually provided by this process, the index is cut free from the qsort library altogether. The library maintains no internal state about the comparator or the index, or what's in the index. No matter what else uses qsort.lib, your index will remain stable, whence you can lookup records at any time.

That is the general idea for making a library able to be shared.


Aside: About Data Structures

Let's take a side tour here for a minute to talk about data structures and memory. A lot of what gets done during development in 6502 assembly is thinking about how to do something, given the lack of standard structures and the tight memory constraints.

For example, if this were PHP or Javascript, a list of records would be in an array. How does memory get allocated for that array? How does it get deallocated? How much overhead is there? Who knows, who knows and who cares, seem to be the answers to those questions.

And a sort routine, what will it sort? It'll sort an array, of course. The array—as an available mechanism—becomes the thing that holds the data, the way the data get accessed, and the means by which the data get sorted and/or indexed. With a standard structure like an Array, many algorithms can be written that simply use arrays as primitives. But on a naked 6502 CPU, with its 56 instructions and 64KB of memory, you don't have arrays. However, I think there is a reason why, which I want to demonstrate with some examples.

When it comes to complex data types such as directory entries (as opposed to simple ones like strings and numbers), you first come up with a structure that can hold all the properties of a single entry. Next, in a C-style array, the structures get put in memory one after the next. Arrays in C are much more primitive than they are in PHP, Javascript or other highlevel scripting languages.

In C, you declare what type the array is, by specifying a struct as its datatype. The size of the array is calculated by multiplying the number of elements you want to store by the element size. E.g., If the struct is 22 bytes and you want an array of 100 of those, you would malloc a chunk of memory 2200 bytes big. On a computer with megabytes or gigabytes of memory, 2200 bytes is so small you scarcely think about where this will come from, and you definitely aren't thinking about such minutiae as: Is it page-aligned?

Now think about how that C array gets used. To reference a particular element, you might specify: myArray[15]. To get the base address of that element, it multiplies 15 times this array's datatype size (15 * 22 = 330) and adds that to the base address of myArray. Whether it's doing a 16-bit or a 32-bit multiplication for every lookup, or whether the compiler optimizes that inside a for-loop, etc., is none of your concern.

Let's go a step further, now, and think about how to manage the dynamic size of that C array. You started by allocating it for 100 elements. What happens if you need far fewer than 100? Well, then you just waste whatever space you didn't use, but no one really cares. What if you fill up 100 elements and discover at runtime that you need more? You can start by trying to realloc(), to see if you can grow the existing allocation to take up more consecutive memory. But if it can't do that, it'll allocate a completely new, bigger block somewhere else and copy all the data from the old block to the new block and deallocate the old block.

Frankly, you get to enjoy this sort of luxury when you have a lot of spare memory. Working with data structures and memory allocations on the C64 feels a lot more like this:

RUSH HOUR. Traffic Jam Logic Game.
RUSH HOUR. Traffic Jam Logic Game.

There is nowhere to maneuver. The structures you're dealing with feel huge compared to the total memory. A tiny little allocation of 2200 bytes is >3% of total memory and a much larger percentage of available memory. You wouldn't just allocate a few kilobytes if you actually only needed a fraction of that. But nor can you allocate 256 bytes, fill that up, allocate 512 bytes, copy the first 256 bytes to the new space, fill up the 512 bytes, allocate 768 bytes, copy the first 512 bytes to the new space, fill up the 768 bytes, allocate 1024 bytes... and so on. That is what you would have to do in order to keep a growing set of data, of unknown size, in a contiguous area of memory, like a C Array.

My understanding of best practice in C is that you should double the buffer size with each new allocation. Start by allocating 256 bytes. If you fill that, allocate 512 and copy the first 256 over. If you fill 512, double that, allocate 1024 and copy the first 512 over. If you fill 1024, allocate 2048, and so on. This is also a luxury of having lots of memory, though. If you need slightly more than one doubling, say you need 17 pages and you have 30 pages free. You double 1, to 2, 4, 8, then 16. You need one more page, but there are not enough free pages to double again. And if you did double again you'd waste nearly 50% of the total space allocated (not even taking into consideration fragmentation and the ability to find those allocations contiguously.)

This reminds me of the Martingale betting strategy in games of chance. If you lose a bet, double your bet the next time. Continue this until you win, and eventually you'll be back in the green. It doesn't work if you A) Run out of money before you hit a win, or B) The table has a maximum betting limit. Either of these two problems feels a lot like running out of a memory while doubling allocation sizes.

This is why I'm using what amounts to a linked list. It's kind of a half-baked linked list, though. It's not a C-style linked list.

Memory Page Linking
Memory Page Linking

In the diagram above we see 4 pages of memory. $35, $34, $33, and $32. Each shows 16x16 squares each, 256 squares for 256 bytes in the page. The purple and blue horizontal stripes occupy 32 bytes each, but the zeroth byte (in black) is unused. Each 32-byte stripe holds a single directory entry. Within a page, directory entries fall one after the next. When a page is filled, a new page is allocated, and the zeroth byte of the current page is set to the page # of the page being added to the chain.

This is similar to how blocks are allocated and linked in a chain in a PRG or SEQ file (or the directory entries!) in a CBM or CMD File System. Consider how it would work: You don't know how much data there will be. So you allocate 1 page. It's page #$33. You start to load in some directory entries, you read in 8 which fills the page, but there are still more. You allocate one more page, page #$35. You write $35 into $3300. You read in 8 more directory entries filling that page, and there are more. Repeat, allocate one more page, you get page #$32, which you write $32 into $3500 to link it to the end of the chain, and so on.

No more than 224 bytes are ever wasted (if the last page had 1 entry and space for 7 more, 7 * 32 = 224). As you add new pages, old pages don't need to be moved. The pages in the chain don't have to be contiguous, new pages can be allocated from spare pages between other used pages. The problem now, is that you don't have a standard structure. You don't just have a C Array of contiguous memory. You can't just use a simple multiplication to get the address of an element. Hence the reason why something like the presort routine in qsort.lib needs a fetch routine. The fetch routine moves from entry to entry through a page, and then adjusts the pointer to jump to the next page in the chain. Like this:

The ptr has to be initialized to the root page first. On each call it reads the ptr into X/Y, and adds #32 to the low byte of the pointer. If the addition rolls back around to zero, we know we've read the last entry of the end of the page. Conveniently, the ptr now points to the start of the page again. Read the zeroth byte through the pointer, that's the page address of the next page in the chain, so write that to the ptr's high byte.


The Comparator Library (cmp.lib)

Comparators can be really simple, and you don't need to include a comparator library just to compare two numbers. In fact, that's what the comparison operations in the 6502 instruction set do for you automatically.

However, a comparator can also get quite complicated. I wrote at length about natural string sorting in the post Directory Library, Natural Sorting. You probably don't want to reimplement this routine if you don't have to.

The comparator code has been extracted and made into its own library (cmp.lib). It only has one public routine, confcmp, configure comparator. It takes an index in the Y register, a comparison type in the X register, and options flags in the Accumulator.

When the sort routine sorts, it configures two zero page pointers from the index tables to compare against each other. If the things being compared are just strings, then the index into those zero page pointers would just be zero. The first character of a string is at offset zero. But if you're sorting structs, each zero page pointer points at a struct. The struct may be composed of strings and numbers, or even other structs, so the Y index is the offset into those structs to find the property actually being compared.

Originally the configurable sort types were for: Filename, File Size, File Type, and Date. That's fine for comparing directory entries, but what about comparing other things? I've translated these into more generic sort types: Natural String (configurable with options for ASCII to PETSCII conversion, and case sensitivity), 16-bit (unsigned, little endian), 8-bit (unsigned), and 40-bit (unsigned, big endian). A filename is just a string. A file size is just a 16-bit unsigned little endian int. And a file type is just an 8-bit unsigned int. The last one is applicable to sorting a date in the format: Y M D h m, as 5 bytes.

The sort routine in qsort.lib takes a RegPtr to a comparator. Call confcmp with the property index, the comparison type, and the string config options, and it returns you a pointer to the routine, which you immediately pass through to sort. You must call confcmp before sort each time, because, as a shared library, some other process may change the configuration of the comparator since you last used it.

The Directory Library (dir.lib)

Lastly, the directory library has been transitioned to be shared. Most of the changes have already been discussed. The quick sort and comparison portions have been externalized, as have the sort indexes. The directory library has been reduced in size to just 4 pages, with room in there to add some new features.

The only routines in the directory library now are freedir and readdir. Some new features include, sending dynamic messages to the application to free up memory if it runs out of memory while loading a directory. If memory cannot be freed, the current partially loaded directory gets freed and then it raises an exception.

There is also now support for non-exception-raising errors. For example, if a partition on a device is no longer available, it will return an error without having loaded anything. This could happen, for example, if you change SD Cards, or you change floppy disks in a CMD FD. It could also happen if a location was saved, then you left C64 OS and deleted a partition from a CMD HD or RamLink, then returned to C64 OS. Other error conditions include, a directory no longer being there, or a disk no longer being in a floppy drive, or a device no longer being available. These are all cleanly handled now.

I haven't implemented support for reading in timestamps from files yet. But I have at least half a page of memory left in the library to implement that.

In the old non-shared directory library, there was also the record routine to pull a record from the index tables. This routine has been completely removed from the libraries. It's not in qsort.lib, cmp.lib or dir.lib. You have to implement this in your own code now. It needs to point directly at your custom index table, and it just didn't make sense to have a routine that configured the record routine, which you had to call before using it. It ended up being simpler to just write a record routine directly against your process's index tables.

Similarly, you used to call confcmp with a sort direction bit in the options. This would rewrite the directory library's jump table entry for record to point either at record_a or record_d (for ascending or descending.) Changing the sort direction now has to be handled by your program. File Manager changes the direction of table columns, whereas the Utilities Utility always lists in ascending order.

Here's what the collection of all four routines looks like in the File Manager, just to give you an idea of how it works:

  • sortdir
  • record
  • record_d
  • record_a

There you have it. It's not a lot of code. And if you don't need to change sort direction3 then it gets reduced down to just the 5 lines of record_a. Record_a's two LDA addresses get modified by the init code that allocates the index pages when the File Manager gets launched.

Final Thoughts

It took a bit of figuring out to support loading and unloading shared libraries, doing the resource counting, and so on. And then I had to transition the libraries themselves to be shareable. And I had to update the existing Utilities and Applications that depended on them to load and unload them with the new mechanism, and to use them properly.

But now that that's done, I feel much better moving forward.

Here's a quick video showing memory usage of the Utilities Utility varying depending on whether it's being opened concurrently with App Launcher or File Manager. When opened with the File Manager, it consumes 12 pages of memory (3KB). But when opened with the App Launcher, it consumes 18 pages (4.5KB).

Why does it take 1.5KB less memory when run with the File Manager? Because those 6 pages are in libraries that the File Manager has already loaded in that the Utility can share.


Correction: At time 00:47 I say, "...uses 54 pages." That should be, "54 pages are free."
  1. Changing sort direction does not require resorting the pointers. Sorted pointers are always in ascending order. To retrieve a record in descending order the requested index only needs to be rewritten as: X = num_of_records - 1 - X. Thus, if there are 55 records, and you ask for X = 0, it rewrites X as 55 - 1 - X, and gives you record 54, the last one. If you ask for X = 1, it rewrites X as 55 - 1 - X, and gives you record 53, and so on.
  2. Nothing hurts more than getting a relocatable binary down to 257 bytes, which forces a full second page to be allocated, just for 254 of those bytes to be unused. If it took up something close to 2 full pages, say, 475 bytes, that's actually not that bad. In theory that would waste ~35 bytes but it also gives a bit of breathing room to fix a bug or two that will inevitably be discovered, or even to implement a couple more small features.
  3. I recently got in a shipment of old Macs. I've been cleaning them, testing them, pairing each machine with its correct keyboard, mouse, external drives, disk sets, etc. I've also been backing up and removing personal data and preparing them to be sold. One of the things I've noticed is that at least in System 7.1, Finder list views can sort by name, size, date, label, etc., but they lack the ability to sort descending. Considering how easy descending is to support, even on an 8-bit machine, it boggles the mind as to why they left that out.

Do you like what you see?

You've just read one of my high-quality, long-form, weblog posts, for free! First, thank you for your interest, it makes producing this content feel worthwhile. I love to hear your input and feedback in the forums below. And I do my best to answer every question.

I'm creating C64 OS and documenting my progress along the way, to give something to you and contribute to the Commodore community. Please consider purchasing one of the items I am currently offering or making a small donation, to help me continue to bring you updates, in-depth technical discussions and programming reference. Your generous support is greatly appreciated.

Greg Naçu — C64OS.com

Want to support my hard work? Here's how!