Written and Maintained by Gregory Nacu

Featured Posts

C64OS.com has grown to be more than just a blog about one developer's progress, it is becoming a resource to surround and support a type of C64 user that wants to benefit from the Commodore community and get the most out of modern hardware expansions for their beloved platform.

After writing many posts on the C64 OS weblog, the unfortunate reality is that some of my best work gets lost in the stream of news and developments. Be sure not to miss these full–length editorial reviews:

May 16, 2017Editorial

Review: FREEZE64 Fanzine

December 5, 2016Editorial

World of Commodore '16

Programming Reference

August 4, 2017Programming Reference

6502 / 6510 Instruction Set

August 4, 2017Programming Reference

Commodore 64 PETSCII Codes

August 3, 2017Programming Reference

Commodore 64 Screen Codes

Search

Needs some ideas? Trying searching for:
PETSCII, Animation, Memory or Pointers

Recent Posts

November 15, 2017Technical Deep Dive

Anatomy of a Koala Viewer

October 31, 2017Programming Theory

Passing Inline Arguments

October 23, 2017Technical Deep Dive

How the C64 Keyboard Works

October 18, 2017Programming Theory

A C64 OS App's Start Of Life

September 26, 2017Hardware

New C64c Cases Available

September 18, 2017Programming Theory

Organizing a Big Module

September 11, 2017Programming Theory

Toolkit Introduction

August 15, 2017Programming Theory

Organizing Module Layout

August 4, 2017Programming Reference

6502 / 6510 Instruction Set

August 4, 2017Programming Reference

Commodore 64 PETSCII Codes

August 3, 2017Programming Reference

Commodore 64 Screen Codes

August 1, 2017Programming Theory

Base Conversion in 6502 (2/2)

July 21, 2017Hardware

Commodore Logo Mark Patch

July 5, 2017Programming Theory

Object Orientation in 6502

June 30, 2017Programming Theory

Base Conversion in 6502 (1/2)

June 20, 2017Software

Huge Site Update

June 5, 2017Software

Recursive File Copier in BASIC

May 29, 2017Technical Deep Dive

Implementing Factorial in 6502

May 16, 2017Editorial

Review: FREEZE64 Fanzine

May 9, 2017Programming Theory

Pointers in Practice, Menus

May 1, 2017Programming Theory

Loading Sequential Files

April 27, 2017Programming Theory

HomeBase Applications

April 21, 2017Programming Theory

Application Loading

April 6, 2017Programming Theory

Memory Manager Development

March 27, 2017Software

Petscii Art Animation

Older Posts

Full Post Archive

Subscribe to C64OS.com with your favorite RSS Reader

News, Editorials, Progress and Reference

November 15, 2017Technical Deep Dive

Anatomy of a Koala Viewer

After 10 years away from the Commodore scene, the most shocking part of coming back was not the speed of a 1 Mhz clock, or even a 320x200 16-color display. The most shocking part is the essential interaction model. And this was my inspiration to start working on C64 OS. GEOS, as I've mentioned in many of my posts, comes much closer to what you would expect from a modern computer. But even it has all kinds of oddities that show its age. And unfortunately, its fully bitmapped UI is just too damn slow to get used to. Using a C64's READY prompt to do ordinary computing tasks is truly a blast from my adolescent past. But returning after a few years away and learning all over again how it works, it feels like a totally foreign world.

Nothing better exemplifies this than my most recent experience working with koala image files and a koala viewer program. So I'm going to deep dive on exactly what I mean by totally foreign and discuss how C64 OS will work to make the whole model more modern.

What is Koala?

Koala was a suite of drawing technologies for a number of computer platforms in the 80s. You can read all about it on this Wikipedia article. It was developed by a company called Audio Light. The suite consists of the KoalaPad which is a drawing tablet and KoalaPainter which is an accompanying art/graphics program. The program, on a C64, works with Koala format image files. The KoalaPainter program can load existing files, which you can then edit using a wide range of tools like fills, boxes, lines, ellipses, and brushes with different patterns. Then you can save your work to disk, again in the Koala image file format.

That's great! It actually sounds pretty modern when described like that. It sounds like how photoshop works. The quality of the images and assortment of tools, of course scaled for the age of the machine, but it sounds like a pretty standard modern way that a computer works. You have an application, which you load and run, it has a graphical user interface, you pick a file from a directory listing from a harddrive, it loads the file in, you work on it, save it back to disk. You can later copy the image files around individually, back them up, share with your friends, upload to the internet (or to a BBS back in the day), and so on. Very modern.

The KoalaPad drawing tablet

As with any image file format, especially if the files are going to be distributed, the people who receive the image are most likely not artists with intentions of manipulating the image. They are normals like us who just want to look at the beautiful artwork that has been produced by others far more talented than we are. And so to view the image files it doesn't make sense to have to load the entire Koala graphics editing program. Not to mention the fact that the original full graphics editing software likely cost money, as well it should.

What you want then is to have a free viewer that is small and quick to load, which can display the image files created by the full editor to people who just want to look at them. Again, though, this is a very modern concept. You don't have to own photoshop, nor launch photoshop, to look at an image file that was produced by it.1

I like to use a Mac to convert images (JPEGs, PNGs, etc.) to Koala format, (exactly what that format is I'll mention below.) And I also plan to have a network service which will fetch images from URLs and convert them to Koala format on–the–fly for C64s to be able to browse the web, in a more meaningful way, via proxy. A viewer is therefore far more important to me than the original KoalaPainter program. And so I found a simple koala viewer online. It's just 4 blocks (~1 KB) on disk. But… how do you use it? How does it work? Where does the modern end and the antiquity begin?

How is a Koala image formatted?

First, let's talk about the graphic formats we're all accustomed to. When you examine a JPEG, or a PNG, or a GIF, you actually find that the internal structure and layout of the data on the disk—even when they represent the same picture on screen—is radically irreconcilably different from format to format. Why is that? Well, there are proximate reasons and ultimate reasons. I like thinking in terms of the latter. The ultimate reason is because the graphics capabilities of modern video hardware long ago outpaced the increases in storage and load speed from harddisk or network. I'll explain.

A MacBook Pro, today, has a pixel resolution of 2560x1600, and each pixel can show at least 24 bits of color. 8 bits of red, 8 bits of green and 8 bits of blue for every pixel. That's 3 bytes of data per pixel. 2560 times 1600 is 4,096,000 (4 MILLION) pixels, times 3 bytes each is 12,288,000 bytes. That's 12 megabytes of raw data for just a single image that fills the screen. And we all know that many images are in fact larger than the screen and software allows the user to zoom in or out or pan around to see the whole thing. It is not at all practical or economical to actually store all those megabytes for one image.

Therefore, each of the common graphics formats, JPEG, PNG, GIF, etc. use different compression techniques (sometimes lossy) optimized for different general use cases. Each format sacrifices something, sometimes something that is difficult to perceive, in order to dramatically decrease the necessary storage requirements on disk. The task then of the viewer or the decoder, as they're more properly called now, is to uncompress the data on disk (or from a network) and reconstitute the full bitmap data in memory, whence the video hardware actually outputs the data to the screen.

And herein lies the first big difference. On a C64, 16 colors is 4 bits and 320x200 is 64,000 pixels. That would mean at least 32 kilobytes of storage for a full screen of image, but storing and manipulating color data alternately in upper and lower nybbles is a pain. So practically speaking even though 16 colors can be represented with 4 bits, if each color value is given its own byte that doubles the memory storage requirement. So, with that in mind, 64,000 pixels times one byte per pixel takes up almost 64 kilobytes in memory. But the C64 only has 64K of memory.

For this reason, the design of the VIC-II chip relieves it from needing to assign one color to each and every individual pixel. The VIC-II can show a fullscreen image, and if the artist is clever, it can be hard to notice any limitations on colors per pixel, even though there very much are. But if you think about what that means, the VIC-II chip embeds a special type of compression directly into its native display modes. The VIC-II in fact, as we all know, has several display modes, which in a sense can be thought of as multiple native compression formats. Koala makes use of the format called "Multi-Color". This is not a post about the details of how multi-color is structured, but you can read all about it and see some examples here.

The point is, only around 10 kilobytes of memory is required for the VIC-II to show a fullscreen colored image. And what is saved to disk is also very close to 10K, just a few extra bytes are in a Koala image file on disk than the raw data in memory. It's not a bad compression format at all. If you convert a 10K Koala image to PNG, it becomes 20K. If you convert to GIF, it becomes 14K. 10K is pretty good. It becomes apparent, especially if you send the file over to a PC or Mac that doesn't have the same colors–per–pixel limitations, that the on–disk format is essentially just a full bitmap with a rather unique compression scheme and cooresponding set of limitations. In fact, the Mac has no difficulty at all viewing Koala (and other) C64 image formats. You just need the right decoder. You can download and install a quicklook plugin with support for various C64 formats here. This plugin decompresses a Koala image essentially the same way another plugin decompresses a GIF, and converts it to raw bitmap data native to the Mac's video hardware.

And so the format of a Koala image is effectively just a dump of memory. The bitmap and color memory regions are not contiguous in a C64, however, so in a Koala image those three dumped regions are packed together, plus a few extra bytes. And that's it. On disk, a Koala image file is already in what amounts to a compressed format. So it's quick to load and small to store. But, it's better than that, because the viewer program doesn't actually need to do any work decompressing and converting the data to the native format of the video hardware. Because the video hardware interprets the on–disk compression scheme natively.2

This is very different than modern graphics formats. There are other multi-color mode image formats for the C64, you can read all about them and their technicals here. What is so telling is that it only takes a couple of lines to describe the difference between each of these formats. Because they are merely different arbitrary orders of appending together the various dumped regions of memory. Some put color memory first before bitmap memory, some after, some put the background color at the beginning, some at the end, some between the bitmap and color memory, etc. But large swaths of the files from two of these different formats will be exactly the same: the format of the VIC-II's multi-color mode, either bitmap or color data.

Where else do things differ?

Let's start with how you actually use the Koala Viewer that I found. The viewer is so simple it almost boggles the modern mind. It has a one-button user interface, and I mean, a one keyboard button UI. You load it, then you run it, and by default a bunch of crap displays on the screen. Nothing is responsive, there is no anything, except crap on the screen. At first, I would just reset the computer to get out of the viewer. There goes my uptime! (A concept that does not exist in the Commodore 8-bit world.) Power cycling or resetting the machine is a common user interaction model for exiting a program.

It wasn't until I disassembled the program to figure out how it worked that I discovered you can press the space bar to exit the program (or press fire on a joystick in port 1.) It's really that simple. Run the program, press space to exit the program. The end. On afterthought I should have thought to slap the long one, since that is indeed such a common interaction in games and demos that it is more or less a C64 standard. Disassembling the program also got me to figure out how it is you actually use this viewer program to, you know, view something. But I'll return to analyze the code a bit later. First, let's just pretend we knew all along how to use it.

Here's what you do. You first have to use commands from the READY prompt to load the image data into memory. Then you load and run the viewer, and it knows where to look in memory to find that image data. This explains why you just see a bunch of crap if you load the viewer but you haven't first loaded in any image data. The viewer just happily shows you whatever left over crap was in that place in memory from the last thing you ran. The VIC-II happily interprets something, anything, even executable code, as though it were image data. This is all so very very different than anything you'd find on a modern computer. It's just so low level. But it is fun, it's fun to feel yourself so close to the bare metal.3

Loading Koala image data Loading Koala viewer

You'll notice that even though a Koala image file is technically data, as opposed to executable code, it is stored on disk as a PRG type file. This means it can be loaded. But the C64 has two kinds of loads. You can load a PRG relocated to $0801, which is the default, or you can use a relocate flag that prevents the relocation and will instead load the data to the address specified in the first two bytes of the file on disk. That two byte header is not loaded into memory, it is loaded from disk, used to figure out where the following data should go and is discarded. Fortunately, JiffyDOS includes 4 different load commands: (/,^,% and £). Frontslash, up-arrow, percent sign, and British pound sign. These commands alone feel incredibly ancient. They are completely arbitrary one-character BASIC commands. They're so unusual, modern North American keyboards don't even have 2 of these 4 symbols on their keys. They do the following, respectively:

  • Load relocated to $0801 but do not run,
  • Load relocated to $0801 and run automatically,4
  • Load to header–specified address but do not run, and
  • Load to header–specified address and run (by jumping to wherever it was loaded.)

For Koala images, we have to use the percent sign (%) load. This puts the data in memory, but does not attempt to jump to it (thank god, because it's not executable.) Next, we load the viewer. In the screenshot above I loaded it with frontslash (/), so that I could list it to see what we can see before it runs. What we see is the standard basic preamble, SYS 2061, a BASIC command to jump to the immediately following assembly code. Pretty standard way to get assembly code to be loadable/runnable with the "/" and "^" JiffyDOS commands, or ye ol' standard LOAD"PROGRAM",8. Finally, let's run the viewer!

Viewing a Koala Image of DS9 After exiting the Koala Viewer to the READY prompt

Oh. My. Gawd. It's Deep Space Nine!! I'm so excited. But, seriously, it's pretty cool right? It worked. After we enjoy the image for a while we slap the long one to exit the viewer. Strangely though, we're taken back to the READY prompt but something is not quite fully restored. The majority of text on the screen is white rather than light blue. Only the final READY prompt is light blue, and that's because it was drawn to screen by the KERNAL after the program exited.

It should start to feel more and more as though this entire experience is straight out of a computer era that ended decades ago. How is it even possible that we could use the JiffyDOS un-relocated load command to get the image data into memory? For that, we need to examine the file itself. It also takes time to find equivalent tools on the C64 that you might be used to having on a Mac or PC. This is not a pick–on, of course, once you start living in the Commodore world for some time, you do find these tools, and you start to know intuitively which tools you need to use to accomplish common simple tasks.

DraCopy is a great little utility released by Draco in 2009. You can download it here. Besides being a useful 2-panel file system navigator and file copier, it has a built–in HEX viewer. When you don't have a HEX viewer, and you need a HEX viewer, finally finding one is like a breath of fresh air.

Using DraCopy to find a Koala image file Using DraCopy to HEX dump a Koala image file

And there it is. We examine ds9.koa with the HEX viewer and find that the first two bytes are $00 and $60. That's little endian, least significant byte first, for the memory address $6000. When we do the JiffyDOS un-relocated load, it is reading that address from the image file, and using it to know that it should put the image data into the fixed memory address $6000... and up.

Please, take a moment to stop and think about what this means. The image file itself has hardcoded into it a fixed memory address whither the data should be loaded. Unthinkably ancient. Can you even imagine, a JPEG file embeds a memory address that tells the computer where in memory this JPEG data ought to be loaded?! How presumptuous. How parochial and shortsighted. Why should it ever be the decision of the data file where it itself should go in memory? That data file has no idea about what else the computer might be running or using that memory for.

Why it worked this way actually makes sense, from the original KoalaPainter program's point of view. The KoalaPainter program put itself into memory, and intentionally left a space in memory where the graphics data should be loaded to. $6000 to $8710. (More on this particular location when we examine the code of the viewer.) Next, everyone who has coded anything with files on a C64 knows that the KERNAL can do a LOAD much faster than if you loop over repeated calls to CHRIN, reading one byte at a time. It was an eminently reasonable decision for KoalaPainter to save the image data as a PRG with a header address of exactly where the program wants to load the data, for itself.

But from a viewer perspective it makes no sense. The viewer program is not the original KoalaPainter program. And who knows how big it may be or what areas of memory it may occupy, and inside the context of an OS with a memory manager it is even more obscene. But it is what it is, and this viewer program is clearly hardcoded to look for the image data starting at $6000. Here's the thing though, if JiffyDOS handles loading the data into memory, and the data is already formatted as expected by a native VIC-II display mode, then what is it that a Koala Viewer actually has to do?

Digging into the code of a Koala Viewer

I manually transcribed this code to a Gist, from the photographs of an ML monitor's disassembly. All the numbers are in HEX. This isn't typically the style I would use when writing code, and labels are not used, rather hard memory addresses are used because that's how the disassembler produced the output.

Obviously I added the comments. I'm not super familiar with VIC-II programming, so why it has to mess with the VIC's raster interrupt, I truly do not know. Here's the code, let's dig in.

The code consists of three main parts. The first part, lines 1 to 44 are the "main" program. The second part is a subroutine called by the main program to configure the VIC and CIA2, and move data from where it was loaded in memory to places the VIC can use it. And the last part is a short routine wedged into the KERNAL's IRQ service routine, which scans the keyboard and allows the main program to progress past viewing the image, to the clean up part and exit.

So, main program, in detail. Mask CPU interrupts so we don't get disturbed while setting up memory and the VIC. Then it also masks CIA1's interrupt generator, not 100% sure why this is necessary since the CPU is already ignoring interrupts.

Lines 8 to 11 are what wedge this program's IRQ routine into the KERNAL's IRQ service routine. The KERNAL hops through a vector at $0314/$0315.

Next, it turns off the VIC's display. This causes the screen to blank out. I believe that's because it will look cleaner to move all the graphics and color data into place while the screen is off. And the screen can be turned back on when everything is ready to go. At this time the border color is also set to black. The border color must not be specified by the Koala image format. That's a shame, if you ask me, it's an important part of the image. Especially if you wanted the edges of the image to blend seamlessly into the border.

At line 17, the main program calls the only proper subroutine. So let's go check out what that subroutine does.

The first thing it does, from lines 50 to 72, is copy 2 kilobytes of data. The way it does this is very cool. It actually loops just 256 times, one complete cycle of the X register from $00 to $00 counting backwards. On each loop, it copies 8 regions in parallel. The first 4 regions are the 1K (256 * 4 = 1024 bytes) of Koala data into Color Memory. The second 4 regions are 1K that are copied from Koala data into "Screen" Matrix Memory. In the Multi-Color mode, Screen Matrix Memory is used to hold the extended color data. Both upper and lower nybbles are used to hold 2 additional colors per 8x8 pixel cell.

Interestingly, that's all the memory that is ever copied by this program. What about the 8 kilobytes of bitmap data? Now it becomes clear to us why the Koala image format, as insane as it seems, specifies where it should be loaded in memory. It's loaded to $6000 you'll recall. Let's dig into the VIC for a second.

The VIC-II chip has 14 address lines, not 16. 14 bits can address from $0000 to $3FFF, or 0 to 16383. So, the VIC can see 16K of memory at a time. The most significant 2 bits of addressing are supplied by the CIA2's Port A bit0 and bit1. This means that configuring CIA2 allows you to choose which of 4 blocks of 16K the VIC II "sees." In the C64's main addressing space those ranges are:

  • $0000 - $3FFF (CIA2's $11)
  • $4000 - $7FFF (CIA2's $10)
  • $8000 - $BFFF (CIA2's $01)
  • $C000 - $FFFF (CIA2's $00)5

Within a 16K block, there are two 8K chunks. From the VICs perspective an upper 8K and a lower 8K. Bitmap data is just shy of 8K (8 * 1000 = 8000, but 8K is 8 * 1024 = 8192). The VIC can be configured to read bitmap data out of either of these 8K regions, upper or lower, but always aligned to these two regions. It cannot for example arbitrarily read an 8K bitmap that's shifted by just 2K or something like that. If you divide the 2nd bank in half, the lower 8K goes from $4000 to $5FFF, and the upper 8K goes from $6000 to $7FFF. Bingo. $6000 is the start of an 8K bank whence the VIC-II can directly read a bitmap.

So, what's neat about loading a Koala, is that the JiffyDOS (%)-command, (load to the header address,) without any other viewer or decoder program's involvement, literally loads data directly off the disk and straight into video memory exactly where it needs to be for the VIC to display it. Crazy low level. But, hey, it's hard to imagine how it could be more efficient.

Moving on now.

Lines 74 to 94. The VIC's background color is set, which is just one byte from the Koala data into the VIC's background color register. Next, in short order, the VIC's raster interrupt is masked (or unmasked?) I'm not sure what this is for. The byte inside the infinite loop code in the main program is set so the loop will loop, not sure why this had to be done here and not in the main part of the program itself. CIA1's interrupts are, started? Or touched, not sure why this has to be done. Then the VIC's raster counter is set. Next the VIC's memory pointers are configured. This is what tells the VIC to get the bitmap from the upper 8K, and where in the lower 8K screen memory (1K, used in this mode for extended color info) should be found. And the write to DD00 is CIA2, which configures which of the four 16K banks the VIC should see. Lastly, multi-color mode is turned on and the subroutine ends, returning to the main program.

Back in the main program, now that all the color data has been moved into place and the VIC's registers have been configured, the VIC's display is reenabled. And lo and behold the image appears.

CPU interrupts are unmasked, and at lines 24 and 25 the main program just goes into a hard infinite loop. This to me is one of the things that makes this program—and the C64 and its lack of modern OS—feel the absolute oldest. Literally, the program goes into an infinite loop that is just 4 bytes long, 2 opcodes and 2 data bytes. It loads an immediate value of #01 into the accumulator. Then if the value loaded is not zero, which, duh it's not, it branches right back into loading the accumulator again. Argh! While you look at the image, that beautiful 1 Mhz CPU spins like an insane idiot doing absolutely nothing. Almost nothing. Modern computers, and C64 OS will never do absolutely nothing like this.

Almost but not quite nothing. Because there is still the interrupt service routine. So let's look at that. While the main program is in a tight infinite loop, the routine at line 100 to 110 is being called 60 times a second. The first thing it does is clear the VIC's raster interrupt. I'm not super familiar with raster interrupts, but actually, it looks like the CIA1 might not be the one generating the interrupts this time. It's the VIC's raster that generates the interrupts. Not sure why they did it this way. But in any case, reading from DC01 checks the column that the space key, and the control port 1's fire button is in. You can read my recent post about How the C64 Keyboard Works for more detail on why this works.

If the space key is not held down, it skips over the next two lines and returns into the KERNAL's usual IRQ routine. However, if space key is held down, it writes a #00 directly into $0832, that's the address that holds the immediate argument inside the main program's infinite loop! Thus, it breaks the loop and allows the main program to proceed into its second half.

The main program, following the infinite loop, is from lines 27 to 44. Everything is done between a pair of SEI/CLI, so that interrupts are masked at the start and unmasked at the end. Then it does three quick subroutine calls into the KERNAL. These initialize the SID, CIA, VIC and IRQ, and reinitialize the I/O vectors (which unwedges the IRQ subroutine previously discussed.)

The only problem with returning to BASIC immediately, is that the Multi-Color Mode uses color memory for part of the image's color map. But when the VIC is put back into character mode that color memory is used for the colors of the characters. So color memory needs to be reset. But, what to reset it to? In this case, a short loop (again in parallel, 4 regions are being set per loop,) color memory is filled with #01... white. And that explains why all the characters on the screen go white after the program returns to BASIC. The reason is because color memory is shared between character mode and bitmap mode, and when the Koala data is copied into color memory, it clobbers whatever was in there. Then it picks white arbitrarily to put back in. It could put light blue back in and most people wouldn't notice, but what would be even better is if it made a backup of color memory, and then restored from that backup.

So that's the whole process, in as deep detail as I care to go.

Screenshots of the disassembly in an ML Monitor

Some Thoughts About Modernization

Memory

Inside the context of an OS, memory cannot just be clobbered arbitrarily. It just can't be. What if the ethernet driver has allocated some space to buffer some incoming data, and you don't even know where those allocations have come from? Memory in the middle of the computer, especially around $6000 cannot just be blindly filled up with data that is loaded straight from disk. Even if that space were available, you'd need to at least mark it now as occupied, so that future allocations don't come out of it.

The VIC-II shares memory with main system memory. And a bitmap can only be in one of 8 possible locations. In my mind, the ideal place for bitmap data is under the KERNAL ROM. The 4th 16K block that the VIC can see goes from $C000 to $FFFF. That range, in C64 OS, is never allocated by the memory manager. It's too messy to use for arbitrary purposes. $C000 to $CFFF is used for C64 OS code, so that's never available. $D000 to $DFFF is under I/O, and is used in precision ways by C64 OS for storing system related data, which it accesses by patching out I/O when it needs to. And in C64 OS the KERNAL ROM is used and usually patched in, which covers the remaining 8K, $E000 to $FFFF.

However, the VIC doesn't always see what the CPU sees. Even when the KERNAL ROM is patched in for the CPU, when the CPU writes to $E000-$FFFF those writes go into the underlying RAM. And when the VIC reads from that range, it always reads from RAM. That's so great. That means with the KERNAL ROM patched in, code executing within that range can write data into that same range and can be affecting what the VIC is showing in realtime. As long as your code needs the KERNAL, and also wants to show bitmap data, storing the bitmap data under the KERNAL sounds like a no brainer.

If bitmap data is the 8K block under the KERNAL, then screen matrix memory can be configured to be found somewhere inside the 8K block from $C000 to $DFFF. In C64 OS, the layout of $D000 to $DFFF is managed manually, much the way the system workspace is managed manually from $0000 to $07FF. A thousand locations can simply be reserved under I/O space, just for extended color info for multi color bitmaps. No need to dynamically allocate it, that's just where it goes.

What does this mean for Koala image files? Or all the other multi-color format variants? Like, Advanced Art Studio, Amica Paint, CDU-Paint, Draz Paint, etc. When reading these from disk, they should be read more along the lines of how C Structures are read into memory. The headers on a file are prescribed sizes. You declare a structure composed of a certain number of bytes, and you allocate space for that structure. Then you populate the structure by loading a fixed number of bytes from the file into the memory allocated for the structure. And then, you can access all the properties of the struct to know more about the file contents.

A loader for Koala would start by loading 8000 bytes into $E000. Then another load operation would load 1000 bytes into where screen matrix memory will be assigned in $Dxxx. Then, if you want to have color memory preserved, perhaps another 1000 byte block under $Dxxx should be managed by C64 OS as the color memory buffer. But the point is that you have to load the data in chunks.

File Access

It is much faster to use the KERNAL's LOAD routine rather than looping over CHRIN and putting the bytes where you want them to go. It's also easier to use, it requires less code. However, LOAD is brutally uncontrolled. It must load the entire file, and it must all go to one contiguous range of memory. But that's just not going to cut it.

C64 OS has a File module, which not only works with C64 OS File References, but also offers the standard FOPEN, FREAD, FWRITE and FCLOSE. This makes it very easy to load in chunks. You call FOPEN with a pointer to a File Reference (which will get created for you using the C64 OS file manager,) then you call FREAD with a pointer to the open File Reference, a pointer to where the data should go, and a 16-bit length to read. And that's all you have to do. FREAD will put the data in the buffer. It handles errors, it handles looping, it handles hiding the mouse cursor and other implementational details.

Will FREAD be slower than the KERNAL's LOAD? Yes, by definition it has to be, as it uses the KERNAL under the hood. But, on the other hand, we're not loading from a 1541 anymore, (hopefully.) We're loading from a CMD-HD or an SD2IEC, or better yet from an IDE64 or a 1541 Ultimate.

More Sophisticated Data Structures

All of these image formats, (20 of them hires and multi-color, not counting the interlaced and more advanced formats) they all just assume the image is exactly 320x200. And that nothing whatsoever will appear on the screen at the same time as the image.

Storing data in a format native to the VIC-II is fine. But, what about images that are bigger than the 320x200? What about images that are smaller? Smaller images make a lot of sense to me, especially if you've got a background scene, say, already loaded, and you want to load in a small rectangular area to overlay on top of the scene. I can imagine tonnes of uses for this in games, or animation sequences or presentation software. Or even just to overlay some UI controls on top of a bitmapped image. Instead of having no UI, but to slap the space bar, in C64 OS the mouse will be active, what about being able to load in a "close" X, or back and forward arrows and display them on the bitmap screen? They could be loaded directly from disk when you want to show them. The possibilities are endless.

A more modern C64 image format doesn't need to abandon the idea that the data is in a VIC-II friendly format. But what it should have is a standard header that can be loaded first into a struct somewhere. That could specify the dimensions of the image data in 8x8 cells, the data format (hires or multi-color), and the offset into the file for the color and bitmaps sections. It could even specify where on the screen it ought to display, if that made sense. But it wouldn't do that with hardcoded memory addresses, but instead with offsets from a virtual top left corner.

Concluding thoughts

Going into a tight infinite loop as your program's main behavior can't be how it works. I understand why this viewer does that. I mean, what else is it supposed to do? Where else ought the CPU to execute? It's got to execute something, and with the image statically displaying on screen, and the IRQ routine checking for the space bar, there isn't really anything else to do.

But, even though C64 OS isn't "multi-tasking", it's still got system level code that is capable of processing events. After setting up the image to display, in C64 OS, your code would just return. And it's the OS's own main event loop that will do the infinite looping. Meanwhile maintaining the ability to respond to timers, pass mouse and key command events to your code, process incoming network data, and whatever else gets added in future versions of the OS.

Okay, that was long. But it's good to go into this stuff. Leave your thoughts in the comments section.

  1. It's not a perfect analogy because modern image editors store extra data besides the final image. Such as layers, groups, tags and selections. So they export a "flattened" version of the image. Whereas Koala edits the same flat file that is ultimately viewed. But, it doesn't affect my argument much. []
  2. That said, there are some formats that are compressed on disk further. Such as RLE compressed Koala format. These require the viewer to use a decompression routine. However, such schemes are very primitive. []
  3. I wouldn't even know where to begin to actually put data, directly, on to the screen on a Mac. Most nerds would ask, "Why would you ever want to do that?" Just out of curiousity. My guess is that you have to be some sort of super privileged low level OS process to be able to put things directly into video memory. It's so arcane it seems that no one really knows how to do it. []
  4. Note, of course, that these "relocations" don't modify the code. If the code is not intentionally designed to be runnable from $0801 it will crash. The relocate is a simple shift of position in memory. Real code relocation is much more complicated. Although it is possible it requires a special binary file format and a special assembler. See here, for example. []
  5. It tripped me up for a minute, because the program writes #$02 to DD00, to me, that looks like it should be the 3rd 16K bank, 00, 01, 10, 11. 0,1,2,3, right? But I checked the C64 schematics, and indeed, the V14, V15 address lines coming off the CIA2 have a bar above them. That means they're inverted, for whatever reason. So the correct order is 11,10,01,00 for banks 0,1,2,3. Thus #$02 really does give you the second bank not the 3rd. []
October 31, 2017Programming Theory

Passing Inline Arguments

Happy Hallowe'en C64 enthusiasts. This post is about passing inline arguments to subroutines in 6502. Hopefully it won't be too spoooky.

When I first began learning 6502 ASM, many years ago, one of the issues that vexed and confused me the most was how to pass arguments to subroutines. Arguments are basically, the data that the subroutine is supposed to work on. I'd been programming in Javascript, and was learning C, and I knew that functions could take an arbitrary set of arguments, but at the time I wasn't clear on how those arguments were actually passed and accessed by the function. And when I started looking into the KERNAL, as you'll see if you read up on C64 KERNAL documentation,1 all of the calls make use of processor registers to pass data to the routines.

This has some advantages and some disadvantages which I'll talk about below. But the most obvious problem is that the 6502 only has 3 general purpose registers, .A, .X and .Y, plus some flags in the Status Register can be used as single bit arguments. The carry is most frequently used in passing data to and from a subroutine. Each of the registers is only 1 byte (8 bits), so using registers you're limited to a total of 3 bytes of arguments plus maybe a couple of additional bits. It is common when writing C (or Javascript) functions to have 5 or 6 arguments some of which are pointers that need at least 2 bytes each. How can we handle this in 6502 ASM? How does C handle this? That's what this post is about.

The KERNAL and Register Arguments

As limiting as just 3 bytes of arguments might seem, there are a lot of routines one could write in this world that can get by with so few. As mentioned above, the KERNAL uses processor registers exclusively for its arguments. There are times when three bytes aren't enough. But to handle this the KERNAL simply requires you to call one or more preparatory routines first.

The chief example being opening a file for read. In a C program this can be done with a single function, but in the KERNAL you have to supply several arguments. A Logical File Number (1 byte), a device number (1 byte), a secondary address or channel number (1 byte), a pointer to a filename string (2 bytes), plus the length of the string (1 byte). Count 'em up, that's 6 bytes. The KERNAL therefore requires you to call SETLFS with the first three arguments, and then SETNAM with the string pointer and length, before you can call OPEN. (And incidentally, OPEN doesn't take any input arguments.) A bit more complex than C, but not outrageous.

Here's a neat thing about using registers for subroutine I/O. Some KERNAL calls uses more than one register for return data. In the case of OPEN, the carry bit is used to indicate success or failure, and if it failed then .A holds the error code. But some other routines, such as RDTIM (Read Time) returns the 3-byte Jiffy Clock time in .A, .Y and .X. C and Javascript are limited to a single return value (although it can be more than 1 byte).

The advantage to processor arguments is that they're fast. Load the .A register with a value, that could be as short as 2 cycles. JSR to a routine that takes .A as its only argument, that takes a standard 6 cycles, but .A is already loaded and ready to use by the code in that routine. I mean, there is virtually no overhead. Especially if .A was already set by a previous routine's return, there is sometimes literally zero overhead in passing that data through to the next subroutine. So, if you can get away with using just 3 bytes of arguments, it's fast, it's really damn fast.

But there are downsides. Registers are effectively little more than in–processor global variables. Global variables which certain instructions operate directly on, and which those instructions require to work at all. An indirect indexed load, for example, can only use the .Y register, and must use the .Y register. So if you use the .Y register to pass in an argument, but the code in the routine needs to do an indirect indexed instruction, the .Y register has to be overwritten and whatever argument was passed in on it has to be saved somewhere so as not to be lost. This makes matters somewhat more complicated.

The other downside is that if your routine is using .A, .X and .Y for holding on to temporary state, such as the index of a loop, but inside the loop you JSR to another routine, it is necessary to know which registers that other routine (and more complicated, any subroutines that it may call, and so on ad infinitum) will use. The KERNAL routines are all very self–contained. They generally don't go off and call other routines, so their register usage is predictable. In fact, the documentation tells you exactly which registers are used (and thus disrupted) by each KERNAL routine. For example, VECTOR, takes .X and .Y as inputs, and returns a value in .X, but it uses .A in the process. So before calling VECTOR if you care about what's currently in .A you'd have to back it up first.

For the accumulator, backing up and restoring is easily done by pushing and pulling from the stack, but to push and pull .X or .Y from the stack it is much more complicated. The NMOS 6502, and the 6510 (NMOS 6502 variant used in the C64 and C128)2 there is no way to push .X and .Y directly to the stack, nor to pull them directly off the stack. You have to first transfer them to .A, then push that. And you similarly have to pull to .A and transfer to .X or .Y. This adds complication, because it involves disrupting .A. In the end, the most efficient way to write 6502 code that uses and worries about the registers for argument passing and local variables, is to write code carefully by hand. Like an art form.

C Language and the Stack

C is a super popular language, and without even considering how much code is still written in C today, most other modern languages are modeled syntactically on C. Don't believe me, the Wikipedia article, List of C-family programming languages, lists over 60 languages in the C family. One of the most recent being Apple's Swift which was first released just in 2014. The original C language began development in the late 1960s and first appeared in 1972.

Assembly is a step above Machine Code, by abstracting absolute and relative addresses with labels, and instructions with mnemonic codes, but the assembly code maintains a strict one–to–one relationship between code typed and instructions output by the assembler. The assembly programmer still has complete control over the precise gyrations the CPU will go through. C is a compiled language, which means by definition the programmer is not the one writing the Machine Code, the compiler is. The compiler is not an artist. It needs simple, reliable rules, that operate on a very local scale to produce predictable output that will work. There is no way that a compiler would go off looking through the code of routines that the current routine will call, to see which registers it will use and thus which are safe for its own use. It's just, too artsy. It's not rigorous enough.

Instead, the way C works is that all arguments are passed on the stack. Not only are all arguments on the stack, but all local variables of a function are also on the stack, and the return value is sent back via the stack too. The big advantage is that every instance of a function being called has its own unique place in memory, and that place is dynamically found and allocated. This makes recursive code really easy. A single function can call itself over and over, deeper and deeper, and each call instance has its own set of variables that don't interfere with the previous instances, because they are each stored further and further into the stack.

Passing arguments is easy too. The declaration of the function (or the function's signature) defines how many bytes of arguments it takes. When the function is called the caller puts that much data onto the stack and the function uses the stack pointer to reference those variables.

So, why doesn't the C64 programmer use C or a C–like solution?

I said it so poetically in my review of World of Commodore 2016, after listening to Bil Herd talk about the development of the C128 and where it was in the market at the time. People back in the 60s and 70s already knew what could be accomplished by a computer with an incredible amount of computing power. Computers in those decades already had 16 and 32 bit buses, megabytes of RAM and complex, timeslicing, multi–user Unix operating systems. The only problem is that these computers cost millions of dollars and were as big as modern cars. But, none–the–less, they had many of the modern computer amenities. What made the 6502 so special was that it was a complete central processing unit, all crammed together into a single IC package that could be sold for just a few dollars a piece, but not that it was a powerful and fully featured CPU. As I said in my article, early home computers really were just toys, in the eyes of the big business computers of the preceding 15 to 20 years.

Take the PDP-11, for example. The PDP-11 was commercially available starting in 1970, contemporaneous with the development of C. It was a very popular 16-bit minicomputer by DEC, that looks like an enormous stand–up freezer with tape reels on the front. Its main processor was in fact the design inspiration for both the Intel x86 family and the Motorola 68K family. And the C language explicitly took advantage of many of its low–level hardware features.

Just as a point of comparison, while the 6502 has 3 general purpose 8-bit registers, the PDP-11 has 8 general purpose 16-bit registers. But the the most important advantage is its addressing modes. The PDP-11 has, among others, 6(!) stack relative addressing modes and efficient support for multiple software stacks in addition to the hardware stack. The 6502 has a single, fixed address hardware stack, a single 8-bit stack pointer, no dedicated features for software stacks, and NO stack addressing modes at all. In fact the 6502 has just two instructions that affect the stack pointer. TSX and TXS. Transfer the Stack Pointer to .X register, and vice versa respectively. What this means is that performing operations directly on data in the stack is very crippling.

It is possible to write in C for the 6502, there is a 6502 C Compiler. But my understanding, from conversations I've had on IRC, is that it works by emulating in software the features of the CPU that are not supported natively. This is a recipe for slow execution. The bottom line is, no matter how convenient C's stack–oriented variables and arguments are, C was not designed for the 6502. And the 6502 is simply not well suited to run C code.

Alternative Argument Passing

Okay, now we know how the KERNAL works and what the limitations are with its techniques. And we also know why it is that the C64 doesn't just standardize on C (or a derivative) like every other modern computer. But we are still left with the problem of how to pass more than 3 bytes worth of arguments to a subroutine.

I can think of at least two ways to work around the limitation and then I'll go into some detail on a third and fairly clever solution. I found these tricks and techniques by reading the Official GEOS Programmer's Reference Guide. It discusses various ways that the GEOS Kernal supports passing arguments to its subroutines.

Zero Page on the 6502 has been described as a suite of pseudo registers. Almost every instruction has a Zero Page addressing mode that can work with data there more quickly than in higher memory pages. But, there is a limited amount of ZP space, just 256 bytes. And $0000 and $0001 are actually taken by the 6510's processor port.

But one trick that GEOS does is reserves 32 bytes of Zero Page as a set of 16, 16-bit registers, which it numbers R0 through R15. After that, it treats them very much the same way we would treat the real registers, and all their limitations described above. I said that one problem with the CPU registers is that they are effectively global variables. And so are the GEOS ZP registers. The advantage is that you've got 16 of them, and they're 16-bits wide. So various routines that need several 16-bit arguments are simply documented as, you put a 16-bit width into R5, you put an 8-bit height into R6, you set a fill pattern code into R3, and a buffer mode in R1, and then you jump to this subroutine and it draws a square and blah blah blah. I'm making up the specifics, but you get the point.

When the subroutine is called, it expects that you have populated the correct global ZP "registers" with suitable values, which it reads and acts upon. It also comes with a suite of Macros that help you set those registers. It's a bit slower than using real registers the way the KERNAL does, and has many of the same limitations, but it greatly expands the space, up to 32 bytes from just 3 bytes. You also have to have a lot of Zero Page dedicated to this solution. This is something GEOS could do because it replaces the C64 KERNAL entirely, and can do whatever it wants with the entire computer's memory space. This is more or less a solution I was able to dream up on my own, but in a less structured way. You can always just shove a few bytes directly into known system working space addresses, and then call a routine that will use those work space addresses. But, there is something that feels a bit dirty about this. What happens if you change the KERNAL and reassign some common work space addresses? Likely, old apps will stop working and will probably crash the system.

An alternative solution that I have come up with on my own, I don't recall seeing this in the GEOS Kernal, is to pass a pointer to a structure. This is the essential behavior in C64 OS of the Toolkit and its view objects. And it is also the way File References work. When calling a subroutine that operates on a file, for example, rather than with the KERNAL having to manually pass 5 different properties of varying 8- or 16-bit widths, the properties are set on a local (or allocated) block of memory. Then a pointer to that block of memory is put in .X and .Y (lo byte, hi byte) and the routine is called. The routine generally needs to write this pointer to zero page, and then use the .Y register as an index in indirect indexed mode to work with those properties.

There is no way around needing to use some global zero page space. As indirect pointers can only be referenced through addresses in zero page. However, moving the responsibility of what free space in ZP to use from the calling routine to the called routine feels much less dirty.

Using structures and passing pointers to structures is a pretty good solution, and it solves a lot of problems. But it isn't the best solution for all problems. Which is why we'll now look in more detail at one more solution I found in the GEOS Programmer's Reference Guide.

Inline Argument Passing

Let's just use the concrete example of what I've run into in C64 OS to understand the issue. C64 OS has a file module, which knows how to work with File Reference structs, and is a light wrapper for working with streams of data. The essential functionality I want is: fopen, fread, fwrite and fclose. Fopen will use a file reference struct to identify the device, partition, path and filename, and also dynamically stores the logical file number which doubles as the status for whether the file is currently open. Using a file reference requires a minimum of a 2–byte pointer. The file can be opened either for read or for write. After which, the pointer to the file reference can be used with fread or fwrite.

If the file was opened for read, for example, then we will want to call fread to actually read some data out of it. The arguments for fread will be, a pointer to the file reference, a pointer to a buffer of memory into which the data should be read, and a 16–bit length of data to read. That's 6 bytes of arguments.

If on the other hand the file was opened for write, then we want to call fwrite to write some data in memory to that file. The arguments will be, a pointer to the file reference, a pointer to a buffer in memory where some data is to be written, and a 16–bit length of data to write. That is also 6 bytes of arguments.

Closing a file reference is pretty straight forward. It only needs the 2 byte pointer to an open file reference. It just reads the logical file number out of the reference struct, closes the file and releases the LFN.

But now let's consider opening the file. If opening for read, not too hard. We need the pointer to the file reference, plus probably 1 byte to indicate the read/write direction. That's only 3 bytes, we could get away with putting the read/write flag in .A and the file ref pointer in .X and .Y.

When it comes time to opening a file to write, though, it becomes a bit more needy. You need the file ref pointer (2 bytes), the read/write flag (1 byte), plus a file type byte (1 byte, USR/SEQ/PRG), and you also need to indicate whether an overwrite should be permitted if the file already exists (1 byte), or if an append to an existing file should happen (1 byte). Naively, that's 6 bytes. Although, you could assign the Read/Write, to 1 bit, the file type to 2 bits and the overwrite and append to 1 bit each, and pack all that into a single byte argument. Dealing with bits makes your code fatter though. And nothing gets around the 6–byte requirements of the fread and fwrite routines.

Some GEOS routines can take inline arguments. Here's how it works:

You do a JSR to a subroutine, and in the calling code, you follow the JSR line with a series of argument data bytes. Effectively, as many as the routine needs. They're called inline arguments because they follow inline with your code.

The routine that accepts inline arguments begins by transferring the Stack Pointer to the .X register. It then uses the stack pointer to find the return address that the JSR pushed on. This address is actually the address where the inline arguments begin. It puts this address into a zero page pointer, where it can use indirect indexed calls to read the arguments (and optionally later overwrite them too if that made sense). But, it must also change the values on the stack to move the return address beyond the end of the inline arguments. When the routine does an RTS, the updated return address will be pulled off the stack and execution will continue just following the inline arguments. It's very clever. It's not always what you need, but it's a really great tool to have in the box if you need it.

Reading about it in the GEOS programmer's reference guide, it sounds a bit complicated to use. So I've written some sample code to show and explain how easy it really is. And how you can abstract the code to reuse it for different routines that take differing numbers of arguments.

Inline Arguments in Practice

Okay, so let's walk through this code and see how it works. We assemble to $0801 which is where a BASIC program should start. And the first include is the basic preamble. This will do a SYS2061, to jump to the first line of our assembly program which starts at line 6. The kernal.s include just defines the names of the kernal routines.

Getting right to the meat and potatoes of what we're showing, line 6 has a jsr getargs, this is an example routine that takes 3 bytes of inline arguments, and will print the three bytes in hexadecimal notation. On lines 7, 8 and 9 you can see that we've supplied three inline bytes of data. These are the arguments, immediately following the JSR instruction. At line 11 we call jsr getargs again, to show that in this case the 3 bytes of inline arguments are in the form of a .text string. It doesn't matter how we inline the arguments, all that matters is that the routine expects there to be 3 bytes and execution is going to continue just after those 3 bytes.

The rts on line 14 is the end of the program. Following this we include our handy inttohex.s. It has a routine tohex that takes a value in the accumulator and returns it as two hexadecimal (0 to F) PETSCII characters in the .X and .Y registers. prnthex is a simple little routine that will print the output of the tohex routine. First it CHROUTs a "$" followed by the two hexadecimal characters and lastly a new line.

And now for the getargs routine and how it works with inline arguments:

I like to comment my routines showing the ins and outs with arrows. In this case I've labeled my inline argument inputs as a1, a2 and a3. Following the right pointing arrow is usually a comment on what the argument holds. I've put in here .byte to indicate the length of the argument. In your calling routine, you could state the inline argument, for example, as a .word, then in the comments mention it's a .word, and you'll have to read two bytes to grab the whole thing.

Lines 5 to 16 do all the magic. The JSR instruction automatically pushes the return address onto the stack. That address is effectively just a 16-bit number. We have to add a number to that address that is equal to the total size of the inline arguments. In this case our args are 3 bytes. So we have to add 3 to that 16-bit return address. But adding 3 to the low byte could overflow that byte, so we have to do a full 16-bit add, using the carry. To begin the 16-bit add we start by clearing the carry, on line 5.

Next, we need to grab the current stack pointer. To do this we use the only instruction the 6502 has for getting the stack pointer, TSX, which transfers it to the .X register. To understand what happens next you have to know how the stack works. The stack consists of the entire first page of memory $0100 to $01FF. But things are pushed onto the stack starting from the top working down. The Stack Pointer is an offset into $01xx and always points to where the next pushed byte will go. Thus, when the stack is empty, the stack pointer is $FF. When the stack is full the stack pointer is $00. With each push, the stack pointer gets decremented, with each pull the stack pointer gets incremented.

After a JSR, whatever the stack pointer is, the return address is at the stack pointer +1, and the stack pointer +2. Here's the thing though, you'd think that once .X holds the stack pointer, you would need to manipulate .X to find the bytes you're after. Instead though, to access bytes relative to the stack pointer you can merely change the absolute start address from $0100 to something bigger or smaller.

Let's say the stack pointer is $F5. If you pushed something onto the stack it would go to $01F5, that means the last two things pushed onto the stack are actually stored at $01F6 and $01F7. Do a TSX, now .X is $F5. You don't need to INX and then read from $0100,X you can simply read $0101,X (which is $0101+$F5) add them together and you get $01F6. Similar to get $01F7 you don't need to INX, just start the absolute address as $0102,X (which is $0102+$F5).

Okay, now we know how we're referencing the address on the stack. Here's the next thing to know. When a JSR happens, the return address is pushed onto the stack high byte first, low byte second. So $0101,X references the low byte, $0102,X references the high byte. Now we're ready to see what happens.

Line 8 grabs the low byte from the stack. Line 9 stores it to a Zero Page address of our choice. It has to be stored somewhere in zero page to be able to read through it as a vector. Once we've stored that low byte, we can add 3 to it and write it straight back into the stack whence it came. Lines 10 and 11.

Adding 3 may have overflown the accumulator, but if it did, the carry is now set. Line 13 grabs the high byte and stores it at $fd, the high byte of the zero page vector. Then we add 0, which includes the carry and completes the full 16-bit add. And we write the new high byte straight back to the stack whence it came. And we're done.

The stack now contains a return address that is 3 bigger than it used to be, just past the end of the inline arguments. And the zeropage addresses $FC and $FD contain a vector that points at the block of inline arguments.

The only thing left to do is read those arguments. Here's one more trick of the JSR return address, it is actually the address of the next instruction... minus 1. But the RTS instruction expects that and returns you to the correct place. So, adding three to the return address was certainly the right thing to do. However, the vector at $FC/$FD actually points to one byte before the inline arguments. Again, no problem, we don't have to waste cycles and memory adding 1 to the vector, we just access the three arguments with offsets 1, 2 and 3, instead of 0, 1, and 2.

Lines 18 to 20, 22 to 24 and 26 to 28 show that we set .Y to the argument offset, then do an LDA indirect indexed through the zero page vector to grab that argument. The jsr prnthex simply prints that argument in hexadecimal which is what this example routine is supposed to do. Note that you don't need to get the arguments in any particular order. You can read any of them whenever it makes sense to in the routine. You can ignore some of them, or even modify and write a new value back to that inline argument memory location. The world's your oyster. You can have up to 255 bytes of inline arguments without needing to modify any of the initial stack manipulation logic.

Summing Up

So you might be asking, well that's neat, but why bother with all that? You could just put a few bytes at the end of your routine to hold the arguments, then load up .X and .Y with a pointer to those bytes and call the routine. The routine then only needs to write .X and .Y to the zero page addresses of its choice and boom you're done. Like this:

And yeah, that works too. But there are downsides. When you put the args below the routine:

  1. You have to use a label to find them.
  2. They are separated from the routine call they apply to.
  3. You have to add code LDX/LDY above routine call.
  4. If there are many blocks of args for different subroutines, it gets messy fast.

And all the converse are advantages for inline arguments. They don't need a label, because their address is known implicitly from the address of the JSR's return address. They sit together in your code with the JSR they apply to. If you have multiple JSRs each with their own arg blocks it doesn't get progressively messier. And you don't need any extra code on the calling side to set it up.

There is just one downside, in my opinion. The called routine has to have a fair chunk of code to manipulate the stack and set up the vector. Instead of 4 bytes for the end argument block, you need 22 bytes for the inline arguments. And if you had many routines that all use inline args those 22 bytes start adding up. Read on for one last solution to that problem!

A Slightly More Advanced Trick

For end arguments, you need 4 bytes in the called routine to setup the vector. And you need 4 bytes in the calling code to setup the .X and .Y pointer to the arguments (plus a label). So you actually need 8 bytes to "pass" the arguments. That's 8 bytes on top of the byte count of the arguments themselves.

With inline arguments, you need 0 bytes in the calling code, but 22 bytes in the called routine. But, if you're going to use this trick for numerous routines with inline arguments, you can move those 22 bytes into a shared argument–pointer–configuration routine, like this:

Screenshots of sample code with inline arguments

Now we've got a new routine called setargptr. It will handle the work of manipulating the stack and setting up the zero page vector. There are some gotchas. First, if we're going to use this routine for multiple routines that accept inline arguments, we need a way to customize how many inline arguments to expect. It can't just be hardcoded at 3.

We pass to this routine in the accumulator the number of inline arguments there shall be. The first thing this routine now does is writes the inline argument count (self–modifying code here) to a new label argc(+1). This overwrites the value to add to the low byte of the return address. The rest of the 16-bit add works as it did before.

The second gotcha is that the real routine, getargs in this case, has to call setargptr. But this call pushes its own return address onto the stack. So the place on the stack where our arguments are is 2 bytes further back. That's easy to deal with though, the new stack absolute offsets are simply $0103 and $0104, instead of $0101 and $0102. That's it.

The only other gotcha is that the setargptr routine now bears the responsibility for which Zero Page addresses to use for the vector. And thus, every routine that uses inline arguments and uses setargptr to manipulate the stack, must all use the same zero page vector. But, depending on your situation, that might be just fine.

In the actual getargs routine now, instead of all that messy code to manipulate the stack we just load the accumulator with the expected number of inline args, and JSR to setargptr. Boom, done. Now, it's 5 bytes per routine that uses inline arguments instead of 8 for end arguments. We actually save 3 bytes per routine. But, the setargptr routine is now 26 bytes. End arguments require 4 bytes per routine, plus 4 bytes per call. So if we write just a few routines with inline arguments, and call those routines a few times. We quickly end up using less total memory than if we used end arguments. But, to be honest, saving some memory is icing on the cake. Inline arguments are just cool.

Thanks for the tip GEOS! GEOS has got a few other tricks I'll probably end up exploring in future posts. Stay tuned.

  1. At some point in the future, I'd like to post my own version of the C64 KERNAL documentation as a Programming Reference. But, it's a lot of work. For now, you can read about the KERNAL in the C64 Programmer's Reference Guide. Chapter 5: Basic to Machine Language (PDF) []
  2. The CMOS version of the 6502, the 65c02, includes extra instructions for pushing and pulling the .X and .Y registers directly to and from the stack, and other handy things. Unfortunately, there is no such 65c10 processor. []
October 23, 2017Technical Deep Dive

How the C64 Keyboard Works

I'm almost ashamed to admit it, but just 1 year ago when I started working on C64 OS, I didn't have the first clue how the C64 Keyboard worked. I knew there was a thing called the keyboard buffer, but I had cloudy thoughts of it being inside the keyboard itself. When I had to actually write the keyboard scanning routine for C64 and started looking at the 1982 fullspread double sided C64 schematics, I could see that the keyboard was wired up to the CIAs, but my first assumption was that when a key was pressed an IRQ would be generated. These ideas in retrospect seem embarassingly ignorant of such a basic feature of the Commodore 64.

But how are we supposed to know until we learn, right? I've written on lots of technical topics, but this post will be a deep explanation of how the C64 reads data in from the keyboard.

The CIAs (MOS 6526)

The C64 has two Complex Interface Adapter chips. Better known as the CIAs, officially the MOS 6526. You can read all about this chip on Wikipedia. But I want to explain some of my most interesting discoveries.

MOS Technology, which was acquired by Commodore and rebranded the Commodore Semiconductor Group, made a whole series of chips which I like to think of as the 65xx family. If you look up MOS 65xx on Wikipedia it'll tell you that this is a series of 8-bit microprocessors. And it's true, the 6501, 6502, 6503, 6504, 6505, 6507, 6508 and 6510, (at least) are all microprocessors. But the 65xx numbered series includes a whole family of chips that are designed to work together on a logicboard, together with some ram, rom, and other glue chips, to implement a complete microcomputer.

The C64 has a 6510 (CPU), 6567 (VIC-II, Video Interface Chip), 6581 (SID, Sound Interface Device), plus two 6526 (CIA, Complex Interface Adaptor) chips. Then it has three rom chips (the shortboard reduced this to two), KERNAL, BASIC and CHARACTER roms, a bunch of RAM chips totalling to 64K, a handful of simple glue chips, in the 74xx and 74xxx series, a 556, a pair of 4066's, and most importantly the PLA (Programmable Logic Array). Together, they work beautifully to provide all that the computer is able to do.

pinout of the MOS 6526 CIA chip

So, what does a 6526 provide? Lots of handy things. It's got 16 registers for configuration, it's got a Time–of–Day clock with programmable alarms, it's got two 16-bit programmable timers, it's got a serial I/O port, and, most importantly for today's discussion, it has two 8-bit bi-directional ports, which we'll talk about in a minute.

What is a port, anyway?

If you'd asked me what is a port, just a year ago, I'd have said that it's a connector on the back or side of a computer, to which you can connect a peripheral device. That's half right, but it misses a couple of critical points.

A port is a line, a wire, that the CPU can control. If it's an output port that means instructions executed by the CPU can force that wire either to output +5V or GND, a logical 1 or 0 respectively. An input port means instructions executed by the CPU can read the voltage level of that line, and if it's near +5V the CPU sees this as logical 1, or near GND which the CPU reads as logical 0. There is virtually always some controlling device that allows the CPU to manage the state of that port line. And in the case of the C64 that controlling device is the 6526 CIA chip itself.

The next important thing to understand is that in a digital computer, one line (1 wire) is one bit. The wire can either be high or low, at a time. So when we say that the CIA chip has two 8-bit bi-directional ports, this no longer has to sound like gobbledygook. An 8-bit port is a port with 8 parallel, simultaenously settable or readable lines. Bi-directional means that each of those lines can be configured either as inputs or outputs. And in the case of the 6526 each line can be configured for input or output individually. At the same time some of them can be inputs and others outputs. A serial port, by the way, which the 6526 has, may as well be called a 1-bit port.

If the 6526 has two 8-bit ports, and each bit requires a line, there should be 16 lines dedicated to those ports. And indeed, when you look at the pinout diagram of the CIA above, reading down the left hand side you see PA0 through PA7, followed by PB0 through PB7. PA and PB stand for Port A and Port B, and each has 8 lines numbered 0 through 7. Amazing. That means the computer, software instructions running on the CPU, can cause those 16 chip legs to be +5V or GND, or can read the +5V or GND status of those legs.

One more step to go, the legs of the chips are connected to traces on the logicboard which run directly to a physical connector which is exposed through the chassis of the computer, and boom, it's called a port. Those physical things we've colloquially called ports our whole lives are literal technical ports, wires the CPU can independently manipulate.

How does the CPU interact with the port?

The question of course is, how does the CPU actually interact with one of those 8-bit ports? The 6526 has to be addressable, and it needs to get data from the CPU. And as we'll recall it must have one line per bit. The C64 (the 6510 CPU) has a 16-bit address bus and an 8-bit data bus, so the CIA must have 16 address legs and 8 data legs, right? Almost.

If we go back to our pinout diagram and read down the right side we see 4 legs labeled A0 through A3, and 8 labeled D0 through D7, in amongst a bunch of other lines we don't need to worry about at the moment. The Ax legs are address lines and the Dx legs are data lines. There are 8 data lines, that's good, those are obviously connected directly to the 8 data bus lines from the CPU. But why are there only 4 address lines?

As mentioned above, the 6526 has 16 configuration registers. What that actually means is that there are 16 memory addresses where the CPU can read or write to interact with the CIA. 4 bits can address from 0 to 15, and we need one line per bit, so bingo the 16 registers need only 4 address lines. But in a real C64 there are two CIAs addressed from $DC00 to $DC0F for CIA 1, and from $DD00 to $DD0F for CIA 2. Somehow, only when the upper 8 bits of the 16-bit address bus are $DC (1101 1100) or $DD (1101 1101) should CIA 1 or 2 be active respectively.

In steps the glue logic. A combination of the PLA chip (Programmable Logic Array), which is a highly custom chip created just for the C64, and a 74239 (an off the shelf 4-bit decoder), are used to monitor the upper 8-bits of the address bus and selectively turn on and off a variety of chips that are themselves connected only to the lower bits of the address bus. The 6526 is enabled or disabled via pin 23, labeled in the diagram above "/CE", for chip enable.

The fact that the chip only has 4 address lines plus a chip enable pin, rather than a full 16 address lines, means that custom glue logic can be used to map the CIA's small addressable range somewhere into a much larger address space. And, as is the case in the C64, multiple CIA chips can be mapped into different places in the main 16-bit address space.

The net result is that when the CPU sets $DC00 (1101 1100 0000 0000) onto the address bus, the PLA and 74239 together enable CIA 1, and disable every other chip on the bus. The CIA 1 chip, being enabled, sees 0000 on its 4 address lines, and the 8 data lines interact directly with whatever the 6526's register 0 actually does. We'll get to a description of the 6526's registers in a moment.

What are the CIAs hooked up to?

Now that we know what a port is, that the CIA offers two 8-bit ports, that the C64 has two CIAs, and how the C64's glue logic is setup to allow the CPU to address them, the next question is, what are those CIAs hooked up?

Part of the C64 schematic, showing how the CIAs are hooked up

Above is a section of the schematics of the C64 logicboard. I've intentionally removed a number of extraneous bits around the edges to try to bring focus just on how the CIAs are wired up. In this diagram there are two main ICs shown, U1 and U2, both are labelled 6526 CIA. Their addresses are written in parentheses (DC00 - DCFF) and (DD00-DDFF)1. I added the labels in blue, #1 and #2, as they are usually referred to in documentation.

Along the left side we have four blocks. From top to bottom: Control Port 2, Control Port 1, Keyboard, and User Port. The two control ports are of course the joystick ports we all know and love on the right side of our C64. The user port is that wonderful geek port on the back of the computer at the far left. The keyboard port is a block of pins in the middle of the logic board, which the keyboard is connected to. It is as much a port as any other, but one that is only internally accessible as a result of the design of the chassis.

Let's start with the user port. You can see that pins on CIA 2 labeled PB0 through PB7, as well as PA2, run directly to pins on the user port's edge connector. Nothing could be more straight forward than that. When you plug something into the user port, you are literally connecting something directly to the Port B legs of CIA 2. Now we know why static electric discharge is so dangerous for the computer, when you touch that user port connector, you may as well be rubbing your fingers across the CIA's legs. There are no over voltage or surge protections of any kind. But at least the connection is very easy to understand.

Here's a thought. When nothing is plugged into the user port, the Port B legs of CIA 2 are very evidently not connected to anything at all. It is neither connected to a +5V source, but nor is it connected to GND. We would say that the legs are in the third state of three–state logic. That is, they're hooked up to effectively infinite resistance. In a sense they are actually connected to ground, but they are connected via a massive bridge of open air, which is highly non-conductive and thus extremely high in resistance. The question is what would the CPU read if it tried to read the values off those legs when they're hooked to nothing? The answer cannot be known by looking at the schematics alone. However, in the 6526's documentation, they are said to internally pull up. That's electronics terminology to mean, when the legs are connected to nothing, the computer will read them as logically high, or 1s rather than 0s. This is important.

How is the keyboard hooked up?

Now let's look at how the keyboard is hooked up. You can see that all 8 bits of both Port A and Port B run straight across to the keyboard connector. Port A's bits are correspondingly labeled COL0 through COL7, and Port B's bits are ROW0 to ROW7. Some of those lines branch off and up to the control ports, but we can ignore those for the moment.

The keyboard connector has just 3 additional lines, plus a spacer (KEY) to make sure you orient the cable correctly. These are +5V, GND, and a special line labeled RESTORE, which we'll get to. I took apart a C64 keyboard, shown below, so we can see what its insides look like. It was already dead, so don't worry I didn't sacrifice it in the name of science.

Physical keyboard matrix

In the image above, I've faux-silk-screened the keycaps onto the circuit board so it's easy to see how the keys line up. Notice that the keyboard connector in the schematics labels the pins 1 through 20. On the keyboard's circuit board you can see the solder points where the wires connect, they aren't neatly in a row, but they are labeled, 0,1,2,3,4,5,6,7,8 and A,B,C,D,E,F,G,H,I. But 9 numbers and 9 letters make 18 not 20. One of those 20 is the orientation key, so it's not connected to anything, and interestingly, as we'll see, the +5V line is not needed for the keyboard, so it's not connected to anything.

Looking at the keyboard's PCB it is pretty clear that it has no electronics intelligence of any kind. It is merely a collection of switches. Each key connects two pads, and each pad is connected along a long snaking trace the joins several pads together and eventually leads to one of the 18 wires of the keyboard cable.

If we follow the traces, or better yet, use a continuity tester, we discover that the RESTORE key joins two pads which are alone on their own traces leading back to the keyboard connector. The image above doesn't show the full PCB, the function keys are missing, but you can see that from the RESTORE key's two pads the traces lead off the right side on their own. One pad is connected to the GND pin, the other to the /RESTORE pin in the schematics. Note the slash before that label, on the schematics it appears as a long bar above the word RESTORE. This means that the restore behavior is triggered by pulling the line low, or hooking it to ground. And indeed, when you press the restore key the key switch simply joins the GND line to the /RESTORE line.

Nothing else on the keyboard is connected to the /RESTORE or GND pins, the +5V pin connects to nothing, and pin 2 is just an orientation spacer, so that leaves us with 16 lines. The lines labeled COL0 through COL7 and ROW0 through ROW7, which connect to Port A and Port B of CIA 1. COL and ROW are for columns and rows, because the rest of the keys are arranged into an 8 by 8 matrix. Leaving out the RESTORE key, with its dedicated lines, if you count up all the other keys on the keyboard (and don't forget the four function keys) you get 65. Ah, but there is one other little thing to notice. The Shift Lock key is wired to exactly the same two lines as the Left Shift key. Therefore, the Shift Lock key is just a mechanically latching switch that to the computer is indistinguishable from the ordinary Left Shift key. Exclude this key, and we're left with 64 keys. And 8 times 8 is 64.

Each ROW line snakes around the board connecting to one half of the contact pads of 8 different keys. And each COL line snakes around connecting to the other half of the contact pads of 8 keys. Such that each key is a switch that connects one ROW line to one COL line. It can be a bit tricky to trace these all out visually, but a continuity tester comes in really handy. The result is that the keys are arranged in the following grid:

Commodore 64 keyboard matrix layout

  Bit 0
$01,$FE
Bit 1
$02,$FD
Bit 2
$04,$FB
Bit 3
$08,$F7
Bit 4
$10,$EF
Bit 5
$20,$DF
Bit 6
$40,$BF
Bit 7
$80,$7F
Bit 0
$01,$FE
Insert
Delete
Return Cursor
Left/Right
F7 F1 F3 F5 Cursor
Up/Down
Bit 1
$02,$FD
3 W A 4 Z S E Left Shift
Shift Lock
Bit 2
$04,$FB
5 R D 6 C F T X
Bit 3
$08,$F7
7 Y G 8 B H U V
Bit 4
$10,$EF
9 I J 0 M K O N
Bit 5
$20,$DF
+
(plus)
P L
(minus)
.
(period)
:
(colon)
@
(at)
,
(comma)
Bit 6
$40,$BF
£
(pound)
*
(asterisk)
;
(semicolon)
Clear
Home
Right Shift =
(equal)

(up arrow)
/
(slash)
Bit 7
$80,$7F
1
(left arrow)
Control 2 Space Commodore Q Run
Stop

Original Source: http://sta.c64.org/cbm64kbdlay.html. HTML reformatted, re-styled and one minor error corrected.

We can easily spot check some of these. It is easy to see on the keyboard circuit board that W and E share a trace on the bottom half of their pads. It is also easy to spot that R and T share a trace on one half. And sure enough when we look in the table above, we can see that W and E appear in the same row. And R and T appear together in another row. Similarly, it is easy to see that Left Shift, X, V and N all share a trace. When we look in the table, sure enough, they all share a column.

Keyboard PCB Wire Mapping

I used a continuity tester to map the Letter/Number pairs of each key, which you can see in the image above. You can see several distinct patterns in keys that are physically close to each other. This also lets us map the Keyboard PCB letter/number scheme to the C64 logicboard's 20-pin keyboard connector. If we just look at PCB trace "A", we see that Space, C=, Run/Stop, and Control all share it. Looking at the key matrix table we see that this is ROW 7. "B" is shared by Left Shift, Z, A, and S. Which, again from the key matrix table, we can see this is ROW 1. If we look at one of the Keyboard PCB's numbered traces, say "3", it is shared by Left Shift, X, V and N. In the key matrix table these are in COL7. Repeating this process, we can construct the following table:

Keyboard PCB to Keyboard Connector Map

KB PCB Label KB Connector Pin C64 Connection
A 9 ROW7
B 11 ROW1
C 10 ROW2
D 5 ROW3
E 8 ROW4
F 7 ROW5
G 6 ROW6
H 12 ROW0
I 1 (or 3) GND (or NMI)
0 13 COL0
1 19 COL1
2 18 COL2
3 20 COL7
4 16 COL4
5 15 COL5
6 14 COL6
7 17 COL3
8 3 (or 1) NMI (or GND)

I am not really sure why these numbering schemes seem so illogical. But if I had to guess it would be a result of the physical limitations of where the traces on the logicboard have to go such that they don't cross over each other. I'm also not sure why the letters and numbers on the PCB don't align better with the ROW and COL numbers. For example, why is A = ROW7 and H = ROW0? Or why is 3 = COL7 and 7 = COL3? Who knows. But if I had to guess about this, I notice that there are extra traces connecting some of the rows and columns on the Keyboard's PCB, but that have had holes drilled through in strategic places to sever them. This could have been a cost saving technique for Commodore to produce the same PCB for different keyboard layouts where the only thing they needed to do in the manufacturing process was drill some precision holes. But I haven't (and probably won't bother) trying to figure out exactly which rows and columns would get swapped if some different configuration of holes were present.

Also, I can't clearly tell between "I" and "8" which goes to GND and which to NMI. Because on my keyboard the actual connector was long ago cut away. And because there is only one key, RESTORE, that joins these two together, it actually doesn't make any difference how GND and NMI are assigned to these two traces. This may all seem to be trivial information, but if you ever want to re-wire a C64 keyboard PCB (which I may someday do with the one I've got), the above table will come in handy.

How the Keyboard's PCB traces are labeled was a bit of tangent. If we return now to what is really happening, it is that when you press a key, say the H key, a wire coming off pin PA3 on CIA 1, goes into the keyboard, goes through the closed H-key switch and back out of the keyboard cable and into pin PB5 on CIA 1. Every key on the keyboard (except RESTORE) merely electrically joins one of the CIA's Port A bits to one of the CIA's Port B bits. But, how does the computer know which keys are being pressed? For that, we need to turn to the software scanning routine.

How is the keyboard matrix scanned?

When describing how the CIAs are hooked up, I mentioned something that was an important behaviour. When one of the Port A or B pins is connected to nothing, it is internally pulled up such that the computer reads the value as a logical 1. When no keys are held down on the keyboard, you may have noticed that the CIA Port pins lead to the keyboard connector, and then into the keyboard, and then through traces across the keyboard's PCB but then they eventually just come to an end connecting to nothing. Therefore, no matter which CIA Port you choose to read from, as long as no keys are held down,2 those port pins are floating and they will read as 1's.

The only thing we can possibly do to change one of the port pins so that it reads as a 0 instead is to connect that pin to ground. But the only thing the pin can possibly connect to, by pressing keys and closing the circuits, is to pins on the other CIA Port. And so we are led back to the 6526 CIAs, their behaviors and how they can be configured via their registers.

Each CIA has 4 address lines, for a binary combination of 16 addressable registers. In the documentation they are referred to with their in–chip address, 0 through 15. Since we're talking about CIA #1 in the C64, and it's mapped to $DC00 to $DCFF, we use $DC00 through $DC0F to access those registers. See this article for full technical documentation of the CIA 6526.

Register's 0 and 1 are the read/write registers that correspond to Port A and Port B. However, the direction of the bits in those registers can be configured independently as either inputs or outputs. The directions of the Ports' bits are configured with registers 2 and 3 respectively. A 0 sets the corresponding bit as an input, for the CPU to read the status of something outside the computer. And a 1 sets the bit as an output, for the CPU to send data or control something outside the computer. So, to set all the bits of Port A for input, you write $00 (%00000000) to $DC02.

    LDX #%00000000
    STX $DC02

Since the keyboard joins the bits of Port A to the bits of Port B, and we need to pull one of those bits low, the trick is to set one of the ports as all inputs and the other as all outputs. Then we set the value of the output port as all low. This makes all of those pins a source of GND. Press a key, and the input pin is connected electrically, through the key, to a source of GND and it changes from its internally pulled–up 1 to its ground connected state, 0. When I first realized that's how it works, electrically, it was a very satisfying discovery.

    LDX #%00000000 ;Inputs
    LDY #%11111111 ;Outputs
    LDA #%00000000 ;Low/GND Outputs

    STX $DC02 ;Port A direction config
    STY $DC03 ;Port B direction config
    STA $DC01 ;Set Port B's outputs low

The problem is the following. If you press A, the low PB1 will pull PA2 low, so we know something in COL2 was pressed. However, if you instead press D, then the low PB2 will pull PA2 low, and you would not be able to distinguish between whether it was A or D, or any other key in COL2, that is forcing PA2 low.

In order to distinguish the rows, Port A has to be read 8 separate times. On the first loop Port B has its outputs set such that Bit 0 is low, but all the other bits are high. During that read, if a key in rows 1 through 7 (and column 2) is held down then those keys are just connecting PA2 to a high, and so they have no affect on PA2, it's already internally pulled up. However, only if the key in row 0, col 2, (Cursor Left/Right) is held down will PA2 get pulled low. After reading Port A and storing it, we loop, reset Port B so all the bits are high except bit 1, then read Port A again to see which keys in row 1 are down, store that, and continue until we've read 8 bytes, which are a "bitmap" of the up/down state of all 64 keys. Any set bit in the map means the corresponding key is up, any unset bit in the map is a key that is down.

And this is, more or less, what the keyboard scanning routine in the KERNAL rom does, 60 times per second. Although, it does a bunch of other stuff I'm not going to talk about today.

Concluding Thoughts

There are interesting similarities and interesting differences between how a C64 and how a PC/Mac keyboard work. PS/2, ADB and USB keyboards, even today, still use a key matrix of rows and columns that need to be scanned in order to build a table of which keys are up and which are down. However, each of these above three connection types is a serial port. In PC and Mac keyboards, the matrix scanning logic is all implemented in smart electronics inside the keyboard. The keyboard itself then maintains a memory buffer of the sequence of key codes representing up and down keys.

On a keyboard with 101 keys, assigning one code for up, and another code for down for each key, that's only 202 values, well within the 0–255 range of a single byte. The keyboard's electronics then also need to implement the serial protocol, and sends to the computer, a byte at a time, the codes representing keys going down and keys going up. It's a bit more complicated than that, and varies from serial protocol to serial protocol, but that is the essence of it.

The C64's keyboard, by comparison, is absolutely dumb. It has no electronics at all. Just switches in an 8x8 matrix. The actual matrix switches themselves are fed down the keyboard cable and into the computer. And the C64's own CPU needs to spend precious time scanning the matrix. It is truly unfortunate that on a machine with only 1Mhz to spare, it needs to use some non–trivial number of those cycles, because whatever it takes to fully scan the matrix once, has to be repeated 60 times a second, just to input data from the keyboard.

But on PCs with thousands of times the number of cycles available, it doesn't need to spend any of those cycles worrying about scanning the keyboard. If the state of the keyboard isn't changing, the internal logic of the keyboard is figuring that out, and it doesn't have to send any data to the computer via its serial connection. And there you have it.

Feel free to leave, questions, comments and corrections in the comments!

  1. The careful observer may notice that on the schematic the address range is DC00 to DCFF, instead of DC00 to DC0F. The PLA and 74239 map in the CIAs whenever the upper 8 bits are $DC (or $DD), even though the CIAs are only connected to the lowest 4 bits of the address bus.

    When a CIA is mapped in, bits 4 to 7 are completely ignored. The effect of this is that the CIAs appear to be mirrored 16 times across DCx0 to DCxF. It is not recommended to address them above DC0x (or DD0x) because future Commodore models could include additional glue logic for mapping other components into those higher address ranges. And, in fact, I believe the C128 does this very thing for accessing its extended keyboard. []
  2. Also assuming none of the joystick/control port lines are connected to anything, but I'm leaving aside the control ports in this post to focus on the keyboard. []
October 18, 2017Programming Theory

A C64 OS App's Start Of Life

I'm pretty close to closing the circuit on being able to launch and run an app in C64 OS. It might feel like it's taken a long time to get to this point, it's been almost a year. But when I think back on my progress, I'm not disappointed. I've had to learn a lot, and I feel like the whole process has really been pulling myself up by my bootstraps. Not only have I learned, realistically, to write code in 6502 ASM, but I've learned a lot about all the various parts of a C64 along the way.

I have my C64 Programmer's Reference Guide, the large, double–sided, fold–out C64 schematics from 1982, and the Advanced Machine Language for the C64 by Abacus. I've also read extensively from the 1541-II user guide and the CMD HD manual, and the Complete Commodore Inner Space Anthology has been super useful. And of course, poring over the disassembled and fully commented BASIC and KERNAL roms and understanding how those routines work has been very important. Learning about software and hardware, and electronics theory at the same time has been unexpectedly rewarding for me intellectually. I feel as though I've learned more about technology in the past year of induldging in C64 retro computing hobby than I have in my fulltime professional software developer career over the past 5 or 6 years.

And let's not forget to add, whenever my fingers get tired of typing, or my butt starts to ache from sitting in my chair for too many hours in the day, I've thoroughly enjoyed the countless of hours of work and planning I've put into my C64 Luggable project. The hours I've sunk into coding and designing C64 OS is nearing a turning point where I can actually launch and run applications. This seems like a good time to take a look at what it's like for the system to start up and launch the default home application, in this case, App Launcher. I've already written at least one post on my ideas for launching applications, so if you're curious you might want to read that too.


The Booter

As I've gone over in previous posts, the C64 OS KERNAL (for lack of a better word) is divided into approximately 10 modules. These are implemented as one assembly source code file per module, plus a header file (*.h) that declares the exported routines, and an includes file (*.s) that declares constants and macros. The booter is not part of the C64 OS kernal, and it isn't launched the way a proper application for C64 OS launches.

The booter launches more or less the way a standard C64 assembly program might launch. It's assembled to $0801 and starts off with the basic preamble that does a SYS 2061 to kick it into action. The booter must be in the system folder, AND the drive whence it is loaded must have its current partition and current path defaulted the partition and path of the system folder. For example, you cannot load the booter with a command like this:

	LOAD "2//SYSTEM:BOOTER",8

If you have to specify the partition and path such as in the above example, because the current default partition and path are somewhere else, the booter will fail. I tested this out on the Wheels "starter", just to see if he did something more clever than me, and the Wheels starter has a similar limitation. It completely craps out if the default partition isn't where the starter is located. Although, I believe the starter must also be in the root directory of its partition. This is not the case with C64 OS.

Let's get this out of the way, C64 OS requires the system to be on a storage device that supports subdirectories. This could be an IDE64, or a CMD HD/RL/FD, or any of the wonderful SD2IEC variants out there. Please see the The Commodore 8-Bit Buyer's Guide for what purchase options you have. Other files, documents, images, whatever can be stored on other any other media, like 1541/71/81 or their emulation partitions on CMD devices, or in .DXX images on SD2IEC devices, etc.

When the booter runs, it configures the system file reference.1 It uses $BA (186) to grab the current device number, then it uses a C-P command to read the current partition number from the boot device, and writes these into the system file reference. Next, it opens the sequential file, "config.t" and starts reading it in. It finds config.t in the current partition and path, and if it doesn't find it there, it craps out. The very first line of config.t (subject to change, of course), is the path within the current partition of the system folder. This path entry is essentially just a reflection of the very path where this file is stored. So, isn't there a way to just get this path (present working directory) dynamically? Actually, reliably getting present working directory is a severe pain in the butt, given the way the various file systems on Commodore compatible drives work. If you don't believe me, feel free to read this lengthy thread about the topic. I actually contribute to that thread, in which I mention exactly what I'm talking about here. C64 OS only needs to know where its system folder is, which it gets partially dynamically and partially by the config.t's path entry. After that, all file references are handled in a standard way internal to the OS. And the concept of a PWD simply becomes a file reference held and manipulated by an application.

Before doing any config, the booter loads all of the C64 OS KERNAL modules into memory. Then it reads the config file and builds the system file reference, and a few more config variables such as default system colors. It then runs a drive detection routine and sets up a table of addresses to device types. The booter then JMP's to "runhome" via the system jump table. This JMP, of course, never returns, and nothing from the booter is ever executed again. Any memory it occupied is already marked as available by the KERNAL's memory manager.

Screenshot of code from the booter
Sometimes I give you the source code beautifully formatted in a GitHub Gist. But today you're getting an authentic view of what it looks like to see this code the way I see it, sitting in front of my flat c128.

System Services

One of the C64 OS KERNAL modules is service. It is a bit meta, but provides a few critically important, well, services. It handles environment variables. It holds onto the aforementioned system file reference. It tracks which home application it should load when an application quits. And it has the runhome routine that is actually responsible for finding and launching the home application. Service is also the module that hosts the main IRQ service routine.

When first called, runhome checks to see if the home app's file reference has been initialized, and if not it allocates memory for it, and constructs it relatively from data in the system file reference and environment data about the current home app. That is very nearly all it does. It then loads the pointer to the home app's file reference into the .X and .Y registers and JMPs to loadapp via the system jump table. After this point, the home application is treated exactly the same as any other app. The only thing special about it is that there is a system service routine that always knows how to get back to the home app.

Screenshot of code from runhome in services 1 Screenshot of code from runhome in services 2 Screenshot of code from runhome in services 3

File Services

Another C64 OS KERNAL module is file. Ideally, it handles all file system accesses. It is actually very lightweight, however, because the C64 OS KERNAL does not replace the C64's KERNAL rom, it just augments it. If you have JiffyDOS then all the actual serial routines are handled by the JiffyDOS rom, including all the KERNAL rom's standard vectors that allow the IDE64 and RamLink to work.

So you might think, well, what does the file module actually do then? Mainly it knows how to work with C64 OS file reference structs, which are constructed and manipulated by other parts of C64 OS and its applications and utilities. These abstract the much more primitive KERNAL system of logical file numbers, and the KERNAL routines SETLFS, SETNAM etc. A C64 OS file reference contains a device address which is used to look up the device type ID from the table built by the drive detection routine that was run by the booter. From that ID file knows if the device supports partitions and subdirectories. The file reference also contains a 16-character filename, a partition number and path. As well as a dynamically managed Logical File Number.

A quick aside; The path can be upto about 230 characters long, which if you have 16 character directory names, plus a one-character delimiter (/), supports nested folders a minimum of 13 levels deep. These can go deeper if the directory names use fewer than 16 characters. 1 character folder names, plus the one-character delimiter, means a maximum possible folder depth of 113. Or an average of about 25 nested folders.2

The file module manages Logical File Numbers (LFN), the ones used by the KERNAL rom, for you. You create a file reference, and open it for read or write. The file module automatically handles getting you a Logical File Number that's not in use, changing the device's default path and partition, and opening the file. Regardless of how you've navigated around the directory structure of the device. When you close a file, it doesn't destroy the file reference, it merely releases the LFN back to the pool, and sets the LFN on the file reference to indicate it's not open. Saving over a previously opened file in a text editor becomes effortless. The reference doesn't just remember the filename, but also where it came from, and will write it back to the correct place.

Screenshot of code from loadapp in file 1 Screenshot of code from loadapp in file 2 Screenshot of code from loadapp in file 3

The first thing loadapp does is initializes the Toolkit. Toolkit has two tables, one for allocating its built–in classes, one for initializing those classes. If an application wants to subclass and extend those classes, it can do so by copying the allocation and initialization tables, which it then extends, and changing the vectors that Toolkit uses to find those tables. Reinitializing Toolkit resets the vectors back to the built–in alloc and init tables. This is necessary to recover from any subclassing efforts of the previous application.

Next it calls prepdisk, which changes the default partition and path of the device so that subsequent disk accesses are relative to the application's bundle. Then it reads the "menu.m" file which every application must have in its bundle. In order to read menu.m, it uses the blksize routine to get the number of blocks the file takes up on disk. Then it allocates the same number of blocks in memory, and reads the file into that allocated space. In C64 OS, atomic reads, that is, file reads that will be opened, read, and closed in short sequence, can use the reserved "templfn" for a temporary logical file number, rather than allocating and releasing one from the pool.

Screenshot of code from loadapp in file 4 Screenshot of code from loadapp in file 5

With the menu data in memory, loadapp JMPs to mnuinit in the menu module. However, I'm not going to go down that rabbit hole in this article. Suffice to say it recursively builds a tree of menu nodes which the mnudraw routine knows how to draw.

The memory manager allocates from the top of memory down, remember, so allocations for the menu data and file references and so on are allocated out of memory starting up near where the C64 OS KERNAL is. This is important, because next loadapp loads "main.o", the application's main executable file, out of the app's bundle. The application's main.o must be assembled to $0800. The KERNAL rom's load routine returns the address up to which the load ended in .X and .Y, low byte, high byte respectively. The high byte is the page number and C64 OS uses a paged memory manager. So .Y is left holding the end page number, and .X is loaded with $08, the starting page number, before doing a JSR to pgmark in the memory module. This tells the memory manager the range of pages to flag as allocated.

Next loadapp initializes the mouse, by calling initmouse in the input module. The previous app has the ability to disable the mouse, this will re–enable it. This will also load it from disk, if it hasn't already been loaded. The mouse cursor's sprite data is held in its own file in the system folder and initmouse uses the system file reference to know where to find that.

In general C64 OS uses a mouse–based UI. It provides a rich mouse–based event system that is designed hand–in–hand with the Toolkit, and it also provides a hierarchical pull down menu system. Together these make building an application with a big feature set or a complex UI reasonably simple. However, there are some kinds of apps that might wish to opt out of some of these provisions. C64 OS reserves the top row of characters for itself. There is a CPU busy animation that spins in the top left corner if an app goes into a long loop and stops responding to user input. Then there is the menu strip itself, and in the top right corner is the clock.

There is already a call to disable the mouse pointer, which short circuits the mouse driver so it stops reading the mouse and stops producing mouse events. There is also a set of system draw flags, three bits of which represent the three top row services. The CPU busy, menubar and clock can be prevented from drawing by disabling these flags via the setflags routine in the service module. loadapp calls this to re–enable these features, in case they were disabled by the previous app.

Applications also have the ability to tap into the system IRQ for timed events. loadapp resets the two vectors for custom timers back to their defaults.

Lastly, loadapp makes two more calls, the second of which it never returns from. The first is a JSR to initapp, and when this returns it does a JMP to evtloop. Both of these calls demand an entire section to describe.

Inside the Application

The reason it's so essential that an application is assembled and loaded to $0800 is because C64 OS expects the application to have its primary jump table starting at $0800.

Screenshot of code from home app 1

The jump table consists of a series of export vectors, exactly the same way each C64 OS KERNAL module begins with a table of vectors. In an application, they are as follows:

  • .word init ;Called with the application first launches
  • .word mcmd ;Called when a menu command is triggered
  • .word quit ;Called when the application is about to quit
  • .word frze ;Called when the application will be frozen
  • .word thaw ;Called when the application becomes active after being frozen

The init routine is therefore the first entry point of the application's actual ability to run any code. But remember, in C64 OS, like in modern OSes, the application doesn't ever just go into a hard loop polling for something like a key to be pressed. Instead, it sets up some UI and returns and lets the OS handle the main event loop. Which we'll get to shortly.

In order for an application to interact with the system at all, to be informed of mouse and keyboard events and to be told when to draw, it must push a screen layer. Typically what it would do is first create the basic UI by instantiating and assembling Toolkit views, and wiring up the event callbacks from some of those views to routines. Then it puts the pointer to the screen layer struct into .X and .Y and calls layerpush.

Screenshot of code from home app 2

The screenlayer struct consists of 4 pointers to: a draw routine, mouse event, key command event and printable key event handlers. The layer is then pushed on to a stack that supports 4 layers. The topmost layer (4th is topmost) gets event priority. The menu system, cpu busy and clock are hardcoded to draw on layer 4. When the application is initialized and it pushes its first layer, it is pushing layer 1, the bottommost layer. At that time, layer 2 and 3 are unassigned.

When a mouse event is generated, layer 4 has the first shot at processing it. Layer 4 delegates this task to the menu system. If the menu system decides the mouse is not interacting with the menus then the event is allowed to be processed by the next layer. 3 and 2 are skipped because they're not assigned and the mouse routine from layer 1, is called. This is code the application implements. So technically the application can do anything it wants. With the caveat that it shouldn't go into a long loop, it should always try to return back to the main event loop as quickly as possible so as to remain responsive to the user.

While the app can do whatever, typically the app will just foward the call to the toolkit. That's what the screenshot of this app is doing. All three of the screen layer's event pointers are set to "forward" which simply does a JMP to tk_proc. Similarly the draw pointer is set to tkdraw.

You might wonder, what's the point of this layer system if it ultimately just calls routines in the toolkit? Why doesn't the main event loop just call those toolkit routines directly? The short answer is flexibility. It is important that any OS strikes a good balance between easy and flexible. The Toolkit handles a tremendous amount of redundant work. And it's super easy to snap a few of the views together to get a useful, responsive and consisten UI. But if you were forced to use the Toolkit there might be things you'd want to do that would be very difficult.

The way I've set up C64 OS, imagine you are fully onboard with using Toolkit, but you want to have some special keyboard command that isn't represented by a menu item in the menu system. No problem. Set up the first screen layer exactly as illustrated above, except instead of setting the kcmd pointer to "forward", set it instead to some other routine. That routine will be called every time a kcmd is generated (and not already handled by a higher screen layer). Your routine can check to see if it is your special command and do something if it is, and if it isn't it can then pass it on to Toolkit. Boom, the app can inject event handling outside the context of Toolkit that easily.

Here's a more dramatic example. Let's say you want to use Toolkit, but you want to have a dedicated area of the screen, say the bottom 5 rows, completely free form and unmanaged by toolkit. No problem. When the init routine configures Toolkit's root view, you set its draw bounds so that it's height is just 19 rows high (25, minus 1 for the menu bar, and minus 5 for the free form area at the bottom) and aligned to the top. Toolkit will now never draw into that bottom area. Then you set the mouse routine of the screen layer to call your custom routine. That routine checks to see if the click is above the bottom 5 rows, if it is, call the Toolkit's mouse handler. Otherwise, proceed to analyze the mouse event however you please to work with whatever you're doing in the free form area.

A third and final example. A weird example, but just to show the possibilities. You want a button in the Toolkit UI to lock the screen for 5 minutes after you've clicked it. So you create that UI with Toolkit and create the button and wire the button to call a routine on click. That routine configures a system timer for 5 minutes (18000 Jiffys). Then, in the screen layer rather than merely forwarding to toolkit, you check to see if that timer countdown variable is zero. If it's not zero then you return without forwarding any events to Toolkit. If it is zero then you forward to Toolkit as usual.

What you want is the ease of being able to rely on the Operating System for all the things that it's good for, and you want the ability to do something custom whenever you want without having to fight or hack the system.

The Event Loop

After the app has been initialized, it has presumably pushed a screen layer onto the screen layer stack, and configured whatever UI it wants to set up. The initializer returns to loadapp which promptly JMPs to eventloop. The Eventloop will therefore never return to the loadapp routine.

The IRQ service routine is still firing every 60th of a second, the mouse has been enabled, so mouse clicks are producing mouse events, and the key presses are producing keyboard events. The main event loop loops infinitely performing the same steps over and over.

Screenshot of code from evtloop 1 Screenshot of code from evtloop 2 Screenshot of code from evtloop 3

When an app quits, somehow the event loop needs to exit, and not in a way which involved code at the end of a JSR, otherwise we'd have a stack leak. Instead, there is a loop break vector (loopbrkvec). When loadapp enters the event loop after loading an app, it enters at a small prelude that clears the loop break vector, before proceeding to step 1. However the loop loops back to step 1 skipping the prelude that clears that vector. Near the end of the event loop, step 5, checks to see if the loop break vector is set. If it's not set, it proceeds to step 6 and back to step 1 and continues on. If the loop break vector is set however, it does an indirect JMP through that vector. Therefore, it truly leaves the event loop never to return. This is what will be used for doing a proper application quit, however I haven't implemented that yet.

The Event Loop then performs Step 1, Step 2 and Step 3. They have a parallel structure so it's easy to explain. They handle the three primitive event types: Mouse, Key Command and Printable Key. First a JSR to the appropriate routine in the input module checks for the presense of an event on the queue. If the carry is clear, there is an event on the queue. If the carry is set it simply moves on to the next step.

If an event is on the queue it JSR's to proclayers. Proclayers is what controls the fact that the topmost layer gets a shot at the event first. If it's a mouse event, it calls the mouse callback for the topmost layer. This routine does whatever it wants. If it returns with the carry set, then the event has been handled and proclayers is finished and returns. If the carry comes back clear, proclayers loops to the next layer down, checks to see if its mouse routine is not null and calls that. This allows higher layers to always get precedence. And higher layers have the ability to deny the lower layers from ever even having a shot at the event. This can be used for model dialogs, and is also what allows, say, the menu system to prevent a click event from passing clear through the menu and into the Toolkit UI below it.

After proclayers returns, the event is immediately dequeued. This process repeats in step 2 and 3 handling keyboard events the same way. Key commands for example are routed to Screen Layer 4 first, which allows the menu system to pick them up. If the menu doesn't handle them, then they make their way to the lower screen layers, and the app can handle the key command outside the context of the menu system. The menu system, if the key command matches a menu keyboard shortcut, calls the Application's mcmd routine, as we saw configured in the Inside the Application section above.

The handling of events however should never cause the application to try to redraw itself to screen. If an event causes the visual state of the application to become out of date, the application should call the markredraw routine of the screen module. This sets redraw flags for the current layer handling the event. More on redrawing in a moment.

Step 4 of the event loop handles checking for an expired timer. The timer system is two–fold. There is a timer update vector, and a timer check vector. The timer update routine is called by the IRQ, and so it is called very reliably 60 times a second. When the timer ellapses however, that's not when the resultant code for the elapsed timer gets called. The Event Loop's execution time is less reliable, because event handlers could be poorly behaved and fail to return to the event loop in a timely fashion. Eventually though, when the event loop is executed, step 4 calls the check timer routine. If the timer is elapsed the check timer routine is free to call whatever code should execute as a result. Including possibly resetting the timer.

Step 5 of the event loop calls redraw. This redraw routine doesn't just immediately start trying to draw something. It checks to see if the mark for redraw flags are set. If they are, then it calls the draw routines on the layers starting with the lowest layer first. So the application's first screen layer push has its draw routine called. In our example this was set to forward to the Toolkit's redraw.

If the toolkit does something it shouldn't do, such as draw into the top menubar, this will be overdrawn almost immediately. Because the main redraw routine will call the draw routine on the next highest screen layer. This also allows the toolkit to trigger redraws as the result of a timer, even when the user has a menu open. The toolkit may then refresh some part of the screen that is hidden under the open menu. After it does its draw, the next highest layer gets to draw, and eventually layer 4 gets to draw, and the menu will be redrawn above the updated toolkit view.

There are some tricks I'm working on to make this fairly efficient. When the menu system is due to start drawing, for example, it will draw the lower three layers from scratch, and then buffer them to a memory area stashed under I/O. Then, unless the lower layers have their redraw bits set, it will refresh the screen by copying from the buffer, before redrawing the menus above that. This will allow the menus to be drawn above a complex Toolkit UI very quickly. There are some other tricks I'm working on as well. The Toolkit will keep track of which View triggered the screen layer to be dirty, and when it is asked to redraw, it can redraw just that one view. This will allow, for example, a user to type into a text field, and on each key stroke only that single text field will continually redraw itself, rather than the whole screen.

After the redrawing phase is complete, step 5 also handles drawing the CPU busy animation. It starts by resetting the animation character to the default state. Then it sets the jiffy time into a system variable indicating when the event loop last went through step 5. The IRQ service routine, which updates the Jiffy Clock, compares the current Jiffy Time to when the Event Loop last ran, and if more than a couple of seconds has passed, the IRQ service routine starts to replace the CPU busy character with the next frame of an animation set. Because the IRQ runs independently of the Event Loop, if some application code that handles an event decides it is going to do something for an extended time before returning to the event loop, the event loop will be unable to update the event time variable, and the CPU busy animation will automatically start ticking. This is, of course, subject to the system draw flags mentioned at the very beginning. An application is free to disable the CPU busy animation. But, when the next application is loaded, it will be automatically re–enabled.

Before step 5 finishes, it checks the loop break vector, as mentioned earlier.

Step 6 has one job. JMP to step 1! And that is pretty much all that the Event Loop does.

Last Comments

That's the basic start and life cycle of an application in C64 OS. It doesn't go into much detail about how drawing really works, or how to make a UI with Toolkit, or how an application will be quit and its memory deallocated. But, it does show how, starting with the Booter, the first application gets itself running and where it has the opportunity to set up its UI, and how it will respond to events.

Thanks for reading. As always, leave comments, questions and thoughts below.

  1. I talk a bit about file references in C64 OS here. I promise that I'll go into more detail on file references in the next post, unfortunately as of this writing, I haven't done that yet. []
  2. 230 characters maximum length for a path string. With 16 character folder names, plus delimiter, we get 230 / 17 = ~13.5 folders deep. If the folder names were all just 1 character, plus delimiter, we get 230 / 2 = 115 folders deep. Or an average of 8 characters, plus delimiter, for 230 / 9 = ~25.5 folders deep. I think this limitation is very acceptable for a C64. []
September 26, 2017Hardware

New C64c Cases Available

I am very excited! I began this blog, almost a full year ago, and 42 posts ago. That's an average of about 4 a month, or one a week. I'm pretty pleased with my ability to consistently add new content to this site. But that's not why I'm so excited. My inaugural post was a mere 58 words. I posted about my eager anticipation of the new C64c cases in a variety of awesome retro color schemes. Here's my original, very first post in its entirety:

These new cases look amazing. I can't wait to get my hands on them.

This is a perfect example of what's great about the C64 in 2016. New hardware is still being manufactured, that looks stylish and new, but true to form at the same time.

I've heard from Jens Schönfeld, personally, that these cases are almost ready.

It took almost a year, but these cases have just recently become commercially available. I should probably add them to the Commodore 8-Bit Buyer's Guide. Feast your eyes on this gorgeous sight.

A beautiful box for your new C64c case
A beautiful box for your new C64c case

Close up, angled shot of new C64c case, SX-64 Style
Close up, angled shot of new C64c case, SX-64 Style

Full frontal of the new C64c case, SX-64 Style
Full frontal of the new C64c case, SX-64 Style

You can order them today from Pixelwizard Retro Shop: https://shop.pixelwizard.eu/en/commodore-c64/

These beautiful C64c cases come in 4 styles:

  • Breadbin Grey
  • Classic Beige
  • Retro Black
  • SX-64 Style

I've got not shame in confessing that to my eyes SX-64 Style is the sharpest. Either for reasons of manufacturing difficulty or simply to shape demand, the SX-64 Style case is also 10 € more than the other three.

This brings me to the price. They're not cheap throw–aways! But, I'm a nerd, you're a nerd, and we both know these precious jewels are worth the dough.

They're 59 €, plus 5 € for an additional (recommended) shipping box. Plus shipping fees, which vary depending on how many you order and where they're shipping them to. The SX-64 Style's base price is 69 €. So, all told, for the SX-64 Style I'll be looking at maybe 89 €, which is 132$ CAD.

It's the perfect home for a C64 Reloaded MK2, which is currently available only as pre-order, for 184,95 €. I look forward to doing a write–up about this too, when it becomes available.

What a fantastic time to be a C64 user!

September 18, 2017Programming Theory

Organizing a Big Module

Let's start with a description of the Toolkit drawing system.

As should be evident from my last post, Toolkit Introduction, I've been hard at work (during whatever spare time I can manage) designing and building out the Toolkit module of C64 OS. I spent the weekend (re)implementing two of the biggest components of the view drawing system which I haven't even had a chance to discuss yet here. To be honest, I've been avoiding discussing it, because it's been so in flux. But, given that this is a blog about my progress, maybe I should be talking more regularly about the thoughts I've been having and the dead ends I've hit and the things I've tried. Drawing, in general, is a complex topic. So it's really hard to cover in a single post. I will eventually dedicate an entire post just to talking about how the drawing system works.

In brief, there is a global structure, maintained by the screen module, called the draw context. It defines the screen and color memory origins (2 byte pointers each), the width and height, 1 byte each because they're in character cells, and these define a rectangular region on the screen. The draw rect is always an on–screen area measured in character cells, so width and height don't need to be 16-bit. 40 columns and 25 rows is well below needing 16-bits. Additionally, the draw context has two more values, offset top and offset left. These are both 16-bit and represent the scroll offsets of the draw context.

The scroll offsets are what make things particularly complex, but as far as I can tell they're necessary. They are what allow the hierarchy of nested views to be scrolled. And they have to be 16-bit otherwise nothing would be able to scroll more than just a few screens.

Every view has the ability to draw itself, and when it does it uses the details in the current draw context to know how much space is available, where it starts in memory, and how offset its origin is from the virtual origin of the view itself (those are the scroll offsets). Typically a view like a button, a label, a checkbox or whatever just have to draw themselves. Views do have the ability to have children, of course, but the leaf views don't have children. A button, for example, even though it inherits node properties (including a child pointer) will never actually have children. In fact if you instantiate a view and assign it as the child of a button, that child will never be drawn, because button's draw routine does not attempt to pass drawing control to any children, because it assumes it doesn't have any.

The View view, aka the root class from which all the other views descend, implements a draw routine which on its own doesn't need to do much except clear the rectangle defined by the draw context. But View is also the main container for laying out children. So, if you want to have a UI that puts two scroll views side by side (or one above the other) for a sort of two-up dual pane, those two scroll views are siblings of each other, and they share a parent. That parent is most likely View. View has a special feature of its draw routine that recursively calls draw on each of its child views. This logic is quite complex, so one wouldn't want to implement it twice, as you'll see very shortly.

There are two other container views, Scroll view and Split view. Scroll view is effectively just View but with scrollbars that can be interacted with to change the offset top and offset left properties of its own draw context. And split view maintains two children and an interactable control for changing the draw context's height of its two children (if horizontal split) or width (if vertical split). But these two container views do not reimplement View's draw logic. They just set the draw context in a custom way, set a couple of custom pointers, and then call View's own draw routines. And that is a big benefit of object oriented code.

When View is recursively walking through its children there are actually three steps it needs to go through. And two of these are what I was working on finishing up this weekend. They are:

  • Resize Node
  • Bounds Check
  • Recontext

A resize occurs when the width or height metrics of a view changes. A global resize flag is set and a view's size is recomputed. This has down stream effects, because views can be anchored in such a way that their size changes as a result of their parent changing size. A resize is usually caused by interaction with a split view but view metrics can also be changed programmatically. If they are one simply has to set the global resize flag manually. There are some efficiencies baked into how this works so that not every view has to have its size recomputed, and one of the flags of the view_rsmask is used to help determine if the view needs to be resized.

Next a bounds check is performed. The containing view, before recursing to its next child uses the metrics of the child and compares against the draw context to determine if any part of the child will be visible on screen. If the scroll offsets are set in such a way that the child is scrolled out of view, then the parent simply skips over this child and moves on to the next.

Recontext is a step that will take me a moment to unpack. Each view, when it draws itself, is drawing itself into a global 10-byte draw context, as described earlier. However, that global draw context changes as the system recursively moves through the view hierarchy. Later on, let's say you mouse down on a button, or you type a character when an input view is in focus, in such a case only that one singular view is updated and needs to redraw itself, (to highlight the button, or insert a character, etc.) But the global draw context is no longer relevant to the view that needs to redraw. We could redraw the entire screen but that would be much too slow. Instead, each view maintains a copy of the draw context, as it was when the view last drew itself.

Those contexts become out–of–date, however, if there is a scroll or resize. So another global flag indicates if a recontext needs to happen. The recontext routine takes the current draw context, for the parent or containing view, and modifies it (it always either stays the same or gets smaller, it cannot ever become bigger) according to the offset and size metrics of the child. The child then backs up that new context onto itself. And lastly it draws itself. When the recontext flag is not set, each time a view is to be drawn the global context is set from the copy of the context on the view, which is much less computationally expensive than the recontext logic. In the event that a single view has to be redrawn, the context is simply copied from that view to the global context and then its draw routine is called as normal. It has no idea it isn't being called as part of an entire screen refresh.

This is very cool stuff. I'm having lots of fun. This is easily my favorite type of code to work on.


The devil in the details, the problem of complex code

So the above was as quick an overview as I can give about how drawing works without getting into the nitty gritty technical details. But I feel it was necessary to give at least this level of detail on how it works to understand the level of complexity being worked with.

Last night, after midnight, as I was getting tired and ready to wrap up, I got the dreaded error message:

LABELNAME OVERFLOW!

The last time I went through this and I asked about it on IRC, everyone seemed to nod in agreement that the time is now right for me to switch to cross assembly. As opposed to the coding native on a C64 (actually a C128) that I've been doing up to this point. A C64's limited memory and computational capacity puts limitations not just on what can be run, but also what can be coded. There is a limit to how many labels can be used. Assembly programming labels stand in for constants, and for memory addresses that will only resolved at assemble–time.

When I first started learning 65021 (coming up on a year ago next month) I was so wet behind the ears that I just started coding everything into one big file. It didn't take me more than a couple of months before I encountered my first Labelname Overflow. I resisted then the temptation and near universal advice to switch to cross assembly, by coming up with an ability to break the project into more manageable modules. This solution and its subsequent refinements to make it more workable are well documented across several posts:

Now I've hit this problem again. Except instead of the entire C64 OS project being too big and unwieldy, just the Toolkit module itself has become too big and unwieldy! It's a similar situation but I think it needs a different solution. And I don't yet know what that's going to be.

Here are some stats. The main source file, toolkit.a, is 56 blocks. A block is 256 bytes, minus the two byte link pointer. So a rough measurement in kilobytes is to divide block count by 4. Toolkit is therefore about 14 kilobytes. But that's the source code, which is full of comments so I don't forget how all this stuff is supposed to work. Assembled, last I assembled it, was about 7 blocks. Or, less than 2K. Less than 2 kilobytes, on a machine with 64 kilobytes of ram. Toolkit's code is under 3% of total available memory and I'm already overflowing the labels! How does anyone write a game that fills or nearly fills the C64's memory?

I think the answer to that question is that they either cross assemble, or huge regions of memory are dedicated to sprites, graphics, music and level data, leaving a much smaller area of memory for the game's engine code in the first place. Or the parts of the game are divided up manually into areas of memory where the connections between the parts can be hardcoded.

As an operating system, C64 OS obviously should try to take up as little memory as it can, leaving as much free memory for the application and its data as possible. This should be on my side for not running into label overflows. But I've also got at least two things working against me.

Toolkit is without a doubt the largest and most complex module of the bunch. But furthermore, because it is object oriented, it is also the most label heavy. I mean, the definition of the view class is effectively just a long list of all the labels that represent offsets to its properties. This problem is exacerbated by another interesting limitation.

Macros are super handy for not having to type everything out long form. They're particularly useful when you're doing lots of 16-bit math and pointer manipulation. But when you call a macro, all of its arguments have to fit on one 40 column line. The macro name automatically gets indented 8 spaces. So, after a macro name that's 8 or so characters, plus spaces, commas and the # symbol, you end up with only 22 or so characters for the names of the arguments. If any one of the names of the arguments exceeds 7 characters, there is suddenly not enough room on the line for just 3 arguments.

I've yet to write a macro that needed 4 arguments, but 3 is a common pattern. Many of the label names for properties are like this: view_draw, view_kcmd, view_kprnt etc. Even short names like this are 9 characters! Now imagine a macro call like this (the preceding white space is part of the line length!):

        #setobj16 this,view_kprnt,view_kprnt_

That is a 45 character line. Aka, it's impossible to type out. And macros (in Turbo Macro Pro+reu) cannot have their arguments spill onto a second line. One way to get around this is to define some temporary labels: vkp = view_kprnt and vkp_ = view_kprnt_. If you do this on the line above the macro call it's close enough in the code that the call remains legible... but you've just blown two more labels on nothing but making a macro call possible.

Another problem that Toolkit faces is that it seems to be the one module that leans most heavily on the resources of almost every other module. It needs math, string, memory, screen, service and input. (Math for 16bit divides and multiplies plus 16-bit macros, String for character conversion and length measurements, Memory for allocating and freeing space for new objects, Screen for the draw context system, Service for environment variables for system colors, Input for reading event objects.) Toolkit is a monster that seems to need to depend on bits and pieces of at least 6 out of only 9 or 10 modules total. Reading in header files and constants files bumps up the label count some more.

It's a tough problem.

But I'm not giving up. And I'm not caving in and switching to cross assembly. The first time I hit this problem I worked out a great solution. And I'll find a solution to this problem too. I have a few ideas in mind that will ease the pressure:

  • Move some big and label–heavy routines (boundschk, resizenode, recontext) out of Toolkit and into a smaller but related module (like Screen).
  • Shorten object property labels: view_... to vw_... or even v_...
  • Reduce the number of property labels by joining some properties conceptually. width and height could be a 4-byte dimension property, accessed as v_dim+2 etc.
  • Split some long external includes into multiple include files so one can be included without including and incorporating the labels from the others.
  • Unmacro a few things that really don't need to be macro'd.
  • Use hardcoded offsets for short-distance branches, rather than highly localized labels like next, skip or loop.

If I do all of the above, I should be back in business for some time to come, and may even be able to get most of Toolkit completed within the remaining constraints. If the above is not enough and I encounter Labelname Overflow again, I could try to split the Toolkit classes into separate files. I would like to avoid that, however, because there is definitely overhead, both memory, execution and organizational.

Thanks for reading. Until next post.

UPDATE: September 19, 2017

Last night I got to work implementing my remediation plan above. But first I decided to do a manual count of how many unique labels are in use by Toolkit. My rough counting came very close to 256, so I'm going to guess that's the key number, for obvious reasons.

On the one hand it feels like quite a few labels, but considering I'm able to use 30 or so in just one single routine, it feels like a cripplingly small number. Fortunately, to my discovery, recontext, resizenode, and boundschk are all called, conditionally, but only one other routine, drawchildren. I am going to move all 4 routines, lock, stock and barrel, to the screen module. And I only have to expose one new jumptable entry. drawchildren. Draw Children was actually something I'd already factored out of View's main draw routine, so it can be called easily by other container views.

  1. I dabbled with 6502 ASM over the course of my long history as a Commodore user. But I never made anything of any substantial complexity. Whatever I knew I forgot and had to completely relearn. When I was programming apps for WiNGs, it was 99.9% in C. []
September 11, 2017Programming Theory

Toolkit Introduction

A few housekeeping issues to cover up front. I want to say, welcome back, because it has been an unusually long while since my last blog post. I absolutely do not intend for this to become a habit. A few things have conspired against me to make this post a while in the making.

It is the late summer, so I had a bit of a holiday, which I thoroughly enjoyed. It took me and my kids and in-laws to a cottage and away from the internet for a few days. But on the whole it didn't keep me away from my c64, as we'll get into in this post. This website suffered some down time during the past week. I believe it was down for approximately 4 days. I wish to extend my sincerest apologies for the outage. We were actually without internet service for the entire labor-day weekend. This is annoying for me, no doubt, but it also happens to be annoying for you too, if you're trying to get to one of several sites that I host locally.1 And as is inevitable when the kids go back to school they brought home novel illnesses that have kept me away from work for a couple of days.

On the bright side, I've been hard at work on C64 Luggable. The documentation of which is coming along quite well. And I've taken many many photos of more recent work on the project than I have yet had time to document. And I've been working away on new additions for the Commodore 8-Bit Buyer's Guide. I have well over 35 (!) new products, parts and components that I will be adding to the guide. Including the catalog items from Poly.Play and Retro Innovations, and a number of independents from AmiBay.org, Lemon64, and ebay.

And let's not forget all the work I've been doing on C64 OS too. It's been a really fun adventure so far. I am continuing to learn about 6502 coding along the way. The Toolkit, as we'll see in this post, involves a lot of 16-bit math, and much of it on values accessed through object pointers, as I began to discuss in that post. I'm sure I'll get into more of that as I continue to discuss the Toolkit.


What is a toolkit?

An operating system is made up of many parts that take care of all sorts of tasks for the application developer. Memory management, abstract file access, string manipulation, input drivers, networking, etc. The toolkit is the part of the OS that helps an application build its user interface. The concept of a toolkit is probably the component that is the most absent from the C64's (and PET, Vic-20 and C128's) built-in operating system. GEOS, on the other hand, offers services for producing menus, both horizontal and vertical, buttons, dialog boxes, actionable icons, single line text input, and well, that's about where its toolkit ends. But that's a big step up from nothing. The toolkit-like features of GEOS are what help to give every GEOS application a standard look and feel.

On Linux, which runs on PCs with much more memory and processing power, toolkits can be dynamically loaded and different applications based on different toolkits can all be running side–by–side. This is actually detrimental to the Linux desktop experience because not all apps feel as though they belong to the same environment. On Windows or macOS, which have a dominant vendor–supplied toolkit, virtually all apps on those platforms use the standard toolkit and consequently feel much more like they belong together.

Because the C64's OS (Kernal and Basic rom) do not provide services for producing a user interface, everyone produces there own. Amongst the chaos there are a few OSes, the applications for which usually (but not always) feel as though they belong to that OS. C64 OS will fit into this latter category.

What does the C64 OS toolkit do?

Unlike on Linux where multiple different toolkits can be and often are available to applications simultaneously (GTK, QT, See: Wikipedia Article on Cross-Platform Toolkits), the C64's small memory size means it is only practical for there to be one toolkit available at a time. And in most cases that toolkit is tightly integrated with the other features of the OS anyway. There is no way, for example, to swap out the UI drawing features of GEOS.

The Toolkit in C64 OS is a module, but its location in memory is fixed, and the objects it produces are designed to interact with other services of the OS. For example, it gets its events from the main event loop in the screen module. The structure of the events it expects are built by the updatemnk routine in the input module. It makes calls to allocate memory for itself to the memory module. All the while drawing itself with a drawing context system also provided by the screen module. The essential behavior of an application is to link its functionality to the actions of toolkit objects that interpret the flow of events being generated by the input devices. This is also what we mean by program flow being event–driven.

The Toolkit is object oriented. I began to discuss how one can go about writing code following an object oriented design pattern in 6502 in a previous post, Object Orientation in 6502. In that post I began to talk about the Toolkit in order to have examples. Object oriented code is, by definition, structured as a hierarchy of interrelated classes. Sub-classes descend from other classes and inherit and extend their functionality. In rich modern UI Toolkits, such as the Cocoa framework of macOS, or Cocoa Touch of iOS (its little brother), there are literally hundreds of classes. And each class has hundreds of methods (related functions which operate on the object's own properties). UIButton, in Cocoa Touch, for example, has 32 methods. But these do not include the 21 methods it inherits from UIControl, nor the ~164 (!!!) methods it inherits from UIView.2 And so on up the inheritence chain to UIResponder and UIObject. Even declaring this number of methods would overflow 64K of memory before we got around to implementing anything. And a toolkit is only one part of what it takes to make an operating system.

Needless to say, the C64 OS Toolkit is very trimmed down compared to what we might otherwise bemoan as the endless bloat of modern toolkits. But the principle is similar. Toolkit is a collection of classes, some of which descend from others. They work together allowing a programmer to create and assemble them to construct a flexible user interface that efficiently redraws itself, responds to user input, and calls back to application code when necessary to implement the specifics of the program. The Toolkit relieves the application developer from a huge burden of effort and results in more consistency across applications and enables rich functionality for free3 in the process.

As of the time of this writing, Toolkit consists of just 6 classes. These have already been briefly discsussed in the earlier post on Object Orientation in 6502. I have several other classes planned, checkbox will likely descend from Button, radio will likely descend from checkbox, and a multi-line text view is an absolute must–have, but I haven't yet figured out where in the hierarchy it will fit.

Object hierarchy of views

This lean (by modern standards) class hierarchy may look unimaginably small, but I think you'd be surprised how much UI complexity can be constructed out of such an essential core.

Sample UI screen shot, extruded with labeled parts

A sample of UI, extruded to show the nesting of views, and labeled.

All Toolkit classes descend from View. View provides several collections of properties and methods which make it a foundation of several types of functionality. We'll just dig right into those here.

Node Hierarchy

A view-based user interface is a hierarchical tree of nodes. Therefore, each class that participates in the UI needs to be a type of node that can connect to the others to allow application code and other toolkit built-in functionality to navigate the tree. The View class therefore provides node properties. And since all Toolkit classes descend from view, all Toolkit classes have these properties and are all therefore types of nodes. The node properties are as few as I believe it is possible to have and still have it work. Each view has a parent pointer, a first-child pointer, and a next-sibling pointer.

Don't confuse the node hierarchy with the inheritance hierarchy pictured above. In a real UI views will be nested within views, and buttons may be stacked one above the other or put inside scroll views. Labels will be put before inputs and so on. The inheritance hierarchy is hardcoded and never changes, but the node hierarchy of a UI could be totally different for every application.

Every node has exactly one containing parent node. The parent node pointer points to that containing node. There is always one root view, which usually fills the screen, but doesn't have to, and has no parent node. Its parent node pointer is null, ($0000), which allows code that navigates the tree to know when it has reached the root view.

Any node can theoretically contain multiple child nodes. However, only the View class has draw logic which is designed to deal with multiple children. Typically if you want a node to handle multiple children you would rely on the View's implementation because it's long and complex. Each node has a pointer only to a single child node. But if that child node is one of many children of the same parent, then the first child uses its next sibling pointer to link to the parent's second child. The second child can link to the third and so on. Each child points back to its parent even if it is not the first child. The last child of a parent is the last node in a chain of siblings, it has its next sibling pointer set null. This is how code can determine that it has reached the final sibling.

If a node's first child pointer is null, it has no children. If a node's first child's next sibling pointer is null, then the parent has only one child. And so on. These three pointers are enough to describe the entire tree and allows recursive code to navigate the entire tree. Navigating up the tree is very efficient because every node has a pointer directly to its parent all the way back to the root node with a null parent pointer. An advantage to one child pointer and sibling pointers is that we don't need to have an intermediate data structure such as an array to hold an ordered set of child pointers.

Sample UI node hierarchy

Above is visualized the node hierarchy of a very plausible C64 OS application UI. Note that the node hierarchy defines the structural relationship between the user interface objects, but alone it doesn't define where on the screen siblings will display relative to each other. That part comes in the next section about view metrics.

Here we see that there is a single root view, in the top row. Its parent pointer is null. It has three children, a Scroll view, and two more Views. But the root view only points directly to the Scroll view. The other two Views are linked horizontally as siblings to the Scroll view. The rightmost View in the second row is the last child of the root view, so its next sibling pointer is null. The Scroll view has one child, a multi-line text view. The middle View has four children, two Labels and two Inputs. Presumably these would be laid out on screen such that each Input has one Label. The final View has three children, three Buttons.

Layout Metrics

The node properties are merely structural. In addition to structure each view needs to know where it should position itself. The positioning of a node is always relative to its parent. In modern UI Toolkits, I'm most familiar with Cocoa and Cocoa Touch from macOS and iOS respectively, views support special layout constraint objects. This allows a node to align, position and size itself relative to not just its parent but to its siblings as well. While this does enable the production of incredibly flexible and responsive layouts, C64 OS's Toolkit won't implement anything like that. Firstly, it's far too complex for the C64's memory constraints, and secondly, the size and orientation of a C64's screen has been what it is since 1982. There would be little advantage to having a complex system of constraints meant to adapt a UI flexibly to a variety of different screen sizes and orientations.

Each view must be able to describe its size and position, relative to its parent. To handle this the View class provides several metrics properties, which all other Toolkit classes inherit from View. These properties are:

  • view_top
  • view_bot
  • view_abot
  • view_hght

  • view_left
  • view_rght
  • view_argt
  • view_wdth

  • view_rsmask

First let's look at view_rsmask. This is a bitfield for flags that affect resizing behavior. At the time of this writing, the low nybble is well defined, but exactly how the upper nybble is being used is still in flux while I write the code so I won't talk about those yet. Here are the values for the lower four bits.

  • %0000 0001 — rs_ankt
  • %0000 0010 — rs_ankb
  • %0000 0100 — rs_ankl
  • %0000 1000 — rs_ankr

These flags stand for: Anchor Top, Anchor Bottom, Anchor Left, and Anchor Right. These declare which sides of the view have a fixed offset from its parent. These define which of the other 8 metrics properties are pre–defined and which are computed dynamically. So let's look at some examples of how this might work.

A view must have at least one vertical anchor, and at least one horizontal anchor. If no vertical anchor is set it defaults to top and no horizontal anchor defaults to left. When the view is anchored, say, to the top, the view_top property defines the distance (in 8x8 cells, not pixels) that the view's top edge sits down from the top edge of its parent. These values are all unsigned 16-bit, so a view cannot be offset negatively from its parent. A view can either be flush with the top of its parent or any offset down, upto 65535 text rows, from its parent's top.

If the view is anchored top, but not bottom, then its view_hght property must be set and is used to figure out how tall the view is. In such a case, vertically resizing the view's parent has no affect on its own height. The situation is similar if the view is anchored bottom but not top. The view_bot property holds the number of rows that this view's bottom edge is positioned up from the bottom of its parent. The view_hght is still relevant to determine how tall the view is and it is still unaffected by vertical resizes to its parent.

If the description is hard to follow, here's a visualization that should help.

Visualization of Top and Bottom anchoring

The anchor flags can be OR'd together, of course. So rs_mask could be: rs_ankt | rs_ankb.

Things get more complicated when a view is anchored both top and bottom. You can see how this works in the third example above. view_top defines how far its top edge is from its parent's top, and view_bot defines how far its bottom edge is from its parent's bottom. But when the view is anchored on both sides then when the parent is resized vertically the height of the view changes, tracking the height changes of its parent.

Whatever way the view is anchored, some of its properties get computed automatically. Let's start with view_hght. If the view is rs_ankt | rs_ankb, the view_hght is computed and set automatically. When it comes to actually drawing the view, what the drawing code really needs to know is the absolute top and absolute bottom offsets, from the draw origin, for where the view will render after any anchoring and positioning logic has been applied. As it happens the draw origin is at the top,left of any given on–screen rectangle. The reason for this is because of the way the VIC-II's memory addresses map to positions on the screen. The top,left corner of the screen is the smallest memory address. And the bottom,right corner of the screen is the biggest memory address. This is true for any arbitrary rectangle you draw on the screen. The top,left corner of that rectangle will always have the smallest memory address of any address within that rectangle.

The consequence of this is that view_top, is both the relative offset from the top of the parent, but it's also the absolute top of the view from the draw origin. This is not true of view_bot. view_bot is relative to the bottom of the parent. So the smaller the view_bot value, the lower that edge goes on the screen. A view_bot value of 5 says nothing about where the bottom edge of the view is going be relative to the draw origin. That's what view_abot is about. view_abot stands for absolute bottom, and it is always a computed property. If the view is rs_ankt, then view_abot is view_top + view_hght. If the view is rs_ankb, then the height of the parent has to be taken into account. view_abot is parent->view_hght - view_bot. And view_top is then computed as view_abot - view_hght.

Visualization of Absolute Bottom computed property

At the end of the day, no matter how the view is anchored, the resizenode routine (called internally, never manually), makes sure that all 4 properties are set correctly: view_top, view_bot, view_abot, and view_hght. After this point, the drawing code can completely disregard any positioning complexities due to anchoring and offsets. It simply draws the view between view_top and view_abot, and view_hght is readily available as a reference. Drawing is way beyond the scope of this post, however, and I'll have to return to it at a later date.

I intentionally limited myself in the above description only to the vertical sizing and anchoring properties. The horizontal sizing and anchoring work in exactly the same way. view_left is analogous to view_top. It is relative to the left side of the parent, but also is the absolute offset from the left coordinate of the draw origin. Therefore a view_argt (absolute right) is computed depending on how the left and right anchoring are configured. The logic here is exactly the same, just along a different axis. And of course, rs_ankl and rs_ankr are represented by different bits in the view_rsmask than rs_ankt and rs_ankb, so all 4 can be set independently and simultaneously.

One more brief note before moving on to a different topic. The above description does not even attempt to broach the issue of scrolling, and what effect scrolling has on the calculation of where a view's children will render, or how they get clipped if they are only partially visible. This is of course all taken into consideration in the design of Toolkit, but is way beyond the scope of this introduction.

As we'll no doubt see in concrete examples in future posts, I believe this anchoring and offsets system (which is based loosely on springs and struts which predate autolayout constraints in cocoa/cocoa touch), can be used to make user interfaces that are very flexible. Probably more flexible than anything else available on the C64, and yet simple enough to be eminently suitable for use in C64 OS.

Event Propagation

One of the main talking points I use when describing C64 OS is that it is event driven. Part of what this means has already been discussed in an earlier post about The Event Model. The IRQ service routine, which in C64 OS is implemented in the service module, updates the mouse and keyboard which converts user input activity into 3 queues of input events. Mouse events, Key Command events and Printable Key events. The mouse events are particularly relevant to Toolkit because the mouse cursor is passing over Toolkit views rendered on–screen, and the user is clicking on those views feeling him or herself to be interacting with them.

However, mouse events contain only: screen coordinates, key modifier flags and an event type. Somehow a simple event struct like this has to convert into meaningful interaction with the underlying on–screen display.

The Toolkit needs to be able to determine the target view, which is the first view that has an opportunity to do something with the event. And then each view needs a way to figure out how to pass notification about the existence of the event if it can't or doesn't handle it itself. This behavior is called event propagation.

Every class in the Toolkit needs to know how to, at a minimum, propagate an event to another view. And so for this reason the View class has a set of methods for event propagation. And every other view subclass automatically inherits the basic ability to propagate events.

To handle the first requirement, Toolkit has a hit test routine. This routine effectively walks the node hierarchy, recursively, searching for the uppermost view that is rendered under the screen coordinates where the mouse event occurred. This is actually easier than it sounds. All of a view's child views are constrained (clipped even, when drawn) to within the bounds of their parent. Therefore, if the mouse event does not fall within the bounds of a given view none of its subviews, or their subviews, etc. need to be checked. Siblings must be checked, however, and it is possible for two sibling views to overlap each other.

This introduces a small complication, because first children are drawn first, and their siblings are drawn later. This means if a late drawn sibling has view metrics that cause it to overlap with with an earlier sibling, it will render above that earlier sibling. What this means is that when doing hit testing, the walking routine must start with the last sibling and if the event did not occur within its bounds the hit test should then move to the previous sibling. Otherwise, sibling 1 could claim the hit, even though it is covered by a section of sibling 2 that the user believes he or she is clicking on. See the visualization below to understand how this works.

Visualization of hit testing priority on overlapping siblings

In the example above, first and second child are two children of the same parent. Their metrics are such that their bounds overlap. The red point represents where the user clicked. That coordinate is technically within the bounds of both children. However, because the first child renders before the second child, the second child draws itself overtop of the first child, where they overlap. Thus, when performing the hit test, it is necessary to test the second child prior to the first child.

Actually propagating the event is the second requirement, once a final leaf node (a node which has no children of its own) is found to be the target that node needs to get the event. In C64 OS, the event data itself is not passed, i.e. it is not copied in memory, to the target node. The node is merely notified that an event is targeting it. Toolkit does this by calling the appropriate event routine. The View class has 3 event handling routines, one for each of the 3 event types, mouse, key command and printable key. In our example of the mouse event, then, the Toolkit calls the target node's view_mouse routine.

This merely informs the view that a mouse event is available. If the view handles mouse events in some way, then it can call readmouse in the input module. This will give it the current mouse event details. I'll discuss this further in the responder section below.

By default, the View class is mostly just a generic superclass and container for multiple children. The default implementation of view_mouse is not to read or analyze the event details at all, but just to propagate the event to the next view. It does this simply by calling view_mouse on its own parent in the node hierarchy. This continues up the node hierarchy until a node either handles the event and returns without propagating it further, or the root node is encountered. The root node has no parent node to propagate the event to, so it simply returns.

Key events, both Key Commands and Printable Key events, are handled similarly to each other. They inherently do not contain screen coordinates, so they do not target arbitrary views in the node hierarchy. Instead, they target special views that claim keyboard focus. Only one view may claim focus at a time. Toolkit maintains a pointer to that view, if no view wants to be in focus, or if a view in focus wants to lose its focus (also lovely said to blur) then the root view is assigned as the focus view.

A lower level than the Toolkit, the screen module, which implements the main event loop, also divides the screen renderer into 4 compositing layers. The top most (4th) layer is fixed as the layer onto which the menubar (with CPU busy indicator, top level menus, and clock) and any pull-down menus are rendered. This causes them to always render above everything else. The lower three screen layers can be pushed and pulled by the application from a 3-layer stack. Typically the application's initialization routine pushes layer 1 onto the screen stack. Each screen layer struct has pointers to 4 routines:

  • sldraw
  • slmouse
  • slkcmd
  • slkprnt

During initialization, the application, if it makes 100% use of the Toolkit, can wire these four routines directly to the four equivalent routines of Toolkit. So that when the main event loop interacts with the low level screen compositor the Toolkit routines are called directly without needing the application to intermediate. The separation of screen compositor and Toolkit is to allow a more advanced application to completely forgo use of the Toolkit and handle the low level events manually (most useful for a game or other special purpose), or to share responsibility with the Toolkit. The application can get the low level events first, do with them as it pleases and choose when and if to forward them to the Toolkit.

That was a bit of a tangent. Regardless of how the Key Command and Printable Key events are handed over to the Toolkit, it directs them to the view in focus. And does this by calling that view's view_kcmd or view_kprnt routine respectively.

These routines, in their default implementation by the View class, do exactly what view_mouse does. They merely propagate the events, passing notification, by calling the same routine on their parent node, until something handles the event or the root node is encountered.

Responding

The last topic I'll cover in this Toolkit Introduction, is responding. As we saw in event propagation above, although the events are propagated through the node hierarchy, the default behavior, implemented by the View class, doesn't do anything. Except pass the notification to the next view which does equally little. The question is, how does anything get accomplished by means of these events?

All Toolkit classes descend from View, and so they inherit this do–nothing–but–propagate behavior for free. Some subclasses however are meant to respond to events. In some cases this response behavior is entirely internal to the class, requiring no special involvement from the application code. An example of this is an input field. An input view is meant to allow the user to input and edit the text that the field manages. If a mouse event is propagated to an input the input reads the mouse event details from the screen module. If its type is a click event, it tells the Toolkit that it should be the new focus view for all incoming key events. And it configures the system's cursor settings so the cursor begins blinking in the right place, and it returns without propagating the click event any further.

Input events also have a disabled flag, however, which can be configured by the application's code at runtime. If the input receives notice that a mouse event targets it, its first checks its own disabled flag. If it is disabled, it calls its superclass's implementation of this routine. This is very likely to be View's implementation, which as we already know, propagates the event to the input's parent view. Neither of these responses to the event necessitate calling back to the application.

Other subclasses of view, such as button, have both internal and external responses to mouse events. When the left mouse button goes down, it generates a leftdown event. This event is propagated, but when a button gets this event, it marks itself to be redrawn, and when it redraws it redraws with an inverted color. This is an entirely internal behavior, it requires no interaction with the application to appear to be highlighting in response to the user mousing down on it. Similarly when the button goes up, a leftup event is generated. (How mouse up events are routed is a bit complicated, and out of scope for this introduction). The button responds to a leftup by redrawing itself without the inverted color, thus appearing to no longer be highlighted.

Buttons are different than Inputs in at least one important way though, when the input was clicked it prepared itself to handle key events, which is an internal change. However, a button has no inherent, internal change as a result of being clicked. The application needs to be informed that the button was clicked. For this reason, the button subclass adds an additional but_clicked property. This is a pointer to a routine. When the application is being initialized, or at some later time in response to some other activity, the application sets the button's but_clicked pointer to a routine the application has implemented.

The button's own reimplementation of view_mouse receives all mouse events as part of the node hierarchy event propagation system. But button's implementation of this routine calls readmouse to get the event details. If it's leftdown or leftup it changes its highlight state as described earlier, but if it's leftclick it checks to see if its but_clicked property is set. If it is, it calls it, passing a pointer to itself in .X and .Y. And then returns without propagating the event.

The application's own routine is now running. It can, in the simple case, do some fixed behavior, something that should always be done as a result of this button having been pressed. Under more complex conditions, more than one button may be set up to call the same clicked routine. The routine needs to distinguish which button was clicked that triggered the call. It can do this by comparing the pointer to the button in .X and .Y with some reference the app maintains to the that button, or it can write the pointer to a ZP address and lookup and access properties on the button itself. This could be used to change the button's text, or change its hidden state, or change its layout metrics, or read its TK_ID to figure out by ID which button this is to figure out what to do next.


Okay, so that concludes this rather long blog post that I've been putting off writing for so long as the Toolkit module has been under such active design and development. In fact, just in writing this post it's given me some ideas about how to improve the inheritance model.

Inheritance was the topic I left off at the end of Object Orientation in 6502 with just a little preview. The truth is, I hadn't quite figured out in my head the full logic of how it would be possible, in your own application, to create custom subclasses of Toolkit classes. But I think I'm almost there.

So, expect that post to be coming sometime in the not too distant future, a part two to Object Orientation in 6502. And after that it will make sense to return to look in more detail at how some of the Toolkit views make use of their object orientation, especially when it comes to drawing themselves.

Lastly, I suspect large swaths of this post will eventually make it into the C64 OS technical documentation. Especially the images I put together to illustrate some of the ideas.

  1. Although it's neither the fastest nor the most reliable way to host a website, I host many websites on my own hardware and internet infrastructure, because it's something of an enjoyable hobby for me. I offload heavy assets to an AWS S3 bucket to distribute load and cut down on bandwidth costs. This doesn't help when we lose internet altogether as was the case this past week. []
  2. It's a bit tricky to tell the difference between a method and a direct access to a property in modern languages like Swift. The notation to read a property is exactly the same notation as to call a method. And the underlying implementation can be changed from a direct read to a method call at any time without changing the code that accesses it. 164 comes from my rough count in the documentation, and includes property accessors. []
  3. Every feature built into the toolkit classes are features that don't have to be explicitly implemented by the application. Yet they automatically avail themselves simply because the UI is built from these standard classes. Think about momentarily inverting the color of a button to provide feedback when it is clicked. That comes free in every C64 OS application. []
August 15, 2017Programming Theory

Organizing Module Layout

At the beginning of the year I wrote a post, Organizaing a Big Project. At that time, I had just split apart what had been a monolithic code file into a series of more managable modules. At the time of this writing they're documented in the C64 OS technical documentation as:

  1. memory
  2. petscii
  3. input
  4. string
  5. screen
  6. menu
  7. service
  8. file
  9. toolkit, and
  10. network

However, this list is in such flux that even though that documentation was updated just 2 weeks ago, this list of modules has already changed. I've merged petscii, (which only had 3 exports: asc2pet, pet2asc and pet2scr) into the string module which is likely to gain far more functionality in the coming months. Plus, I added a math module, 16-bit division and multiplication routines that started their life in toolkit due to their usefulness there, have been moved to their own module so they can be reused by other modules and by C64 OS applications. And lastly, the system jump table has been extricated from memory and made into its own module, thus making the memory module more of a peer to the others.

Suffice to say, the modules are in a state of flux. As I add new routines, optimize old routines and move routines from one module to another, the assembled object sizes of the modules keeps changing. However, they have to be packed together in memory, such that the end of one module is followed by the start of another module. If they overlap the module lower in memory will overwrite the first part of the module higher in memory that's already loaded in. Thus corrupting it and leading to instant and hair-pulling crashes. If there is space left between the modules, well, then that's just wasted memory that can never be allocated, found or utilized.

In addition to the problem of packing the modules together in memory, the main system jump table also has to know where each module starts, so that it knows where to find the module's table of vectors through which to jump. If that weren't enough, there is also the issue that the jump table itself is in a state of flux. With the addition of the math module, I just added mul16 and div16 to the middle to the jump table. Thus, the modules that consume the exports from other modules have to know where in the jump table to find the routine.

Each of these issues has a solution that I worked out at the beginning of the year. But how these solutions work has proven to be quite laborious for me to upkeep.


In order to pack the modules, I have to know how big each is. So I start with the first, the one highest in memory, and assemble it to somewhere arbitrary. The assembler tells me the start and end address of the object code. (And hopefully I don't get any phase errors, which seem to be fairly easy to produce if you use a label prior to defining it.) I write down that size. Then I go to the next module, load it up, assemble it and record its size. I do this for all 10 (or so) modules.

Then I pull out my calculator, a hex/oct/bin/dec converting Casio my brother helpfully gave me for christmas many moons ago. And I start with the last address plus one, so $CFFF+1, or $D000. And subtract the size of the first module to find where it should start, and I write that down. From that address I subtract the size of the next module, and write down its start address. Repeat this for all the modules. Eventually I end up with a table like this:

Module size and start address table

You can see in the image of my notebook that as I work on the code, I have to recompute the sizes and start addresses when it comes time to test my work. Sometimes, such as in the top middle column (second set of size/start columns), I've only worked on a few modules, so I can leave the others alone and just recompute the offsets for the few I worked on. Still, this process is painstaking and boring, and prone to mistakes leading to nasty bugs.

After I've calculated and written down this table of start addresses, I have to go back to each module, open its main file and manually set the initial start address. (*= $ce34) for memory, for example. Then I have to reassemble this module to an object file.

The system Jump Table is another issue. The Jump Table needs to know where each of these modules starts, so I open that module and update a set of labels with these new start addresses. Each jmp is an indirect jump through a vector found at the start of the module + the offset to the routine. If the module exports 5 routines, then it starts with a 10-byte table of five 2-byte vectors. And the Jump Table correspondingly has 5 entries.

  JMP (input+0)
  JMP (input+2)
  JMP (input+4)
  JMP (input+6)
  JMP (input+8)

That sort of thing. So every time a module grows in size, not only does it have to be reassembled, but every module lower in memory shifts down and has to be reassembled too. And then the jump table has to be reassembled also.

This is a big pain. But, let's not forget about how a module that wants to call an exported routine from another module finds the entry in the jump table. Each module has a .h header file that defines labels for each routine, and sets them to the position within the jump table for the module's block of jumps, plus the 3-byte offset. Such as:

  readmouse = inputbase+0
  mouserc   = inputbase+3
  deqmouse  = inputbase+6
  readkcmd  = inputbase+9
  deqkcmd   = inputbase+12
  readkprnt = inputbase+15
  deqkprnt  = inputbase+18

  etc…

Thus we have another file that needs to be updated. If a routine is added to a module that needs an entry in the system Jump Table, it will offset the jump table base address for preceding modules. They then need to have their header files updated… and any module that includes that .h file and calls one of those routines needs to be reassembled so they have the right jump table offsets.

Needless to say, it's easy to forget any of these many places that need to be updated. And it's hard to remember exactly which modules depend on which other modules. I end up having load the source code for each module and look through their includes to see if they include an affected header file.


I need a better way to work this, because the burden is so heavy that its discouraging me from wanting to work on the project. And it gets worse the more code I write and the more modules I add.

Finally, it occurred to me that all those calculations of offsets that I'm doing could actually be done by the assembler, if only it had some basic information:

  • How big is each module, and
  • How many routines does the module export

Instead of writing these numbers down on paper, I could put them all together into a header file to be included by everything that needs one of these numbers. Here's what it ended up looking like:

The modules.h header file, top The modules.h header file, bottom

The top half of the file declares all the variable data I need to provide, in one place. One label for each module (prefixed with x for exports), for how many routines are exported by the module. One label for each module (prefixed with y, because it's close to x), for the size of the assembled module. Unfortunately I still have to manually determine the size of a module, however this only needs to be calculated for the modules I'm working on, which need to be reassembled anyway.

The bottom half of the file is where all the magic happens. First, the highest thing in memory is the Jump Table. And as of just a few nights ago, this is no longer part of the memory module but is its own stand alone file. One label is defined for each module (prefixed with j for jump table). The start of the jump table entries for memory are $d000 (hardcoded as the last address + 1), minus the number of exports for this module times 3. Three because each jmp() takes three bytes.

This gives us a label jmemory that indicates where the jump table entries for memory begin. But this also becomes the relative start address for where the next module's jump table entries begin. Thus jinput starts at jmemory minus xinput*3. And so on, all the way through all the modules. These offset labels are computing automatically simply by adjusting the number of exports each module offers at the top of the file.

The memory module is now just a standard module, structurally the same as any of the others, with its own table of vectors to its exported routines. It's start address is going to be offset from the final (lowest in memory) jump table block. There is one label per module (prefixed with s for start address). The Jump Table itself is first module, but its start address is the same as whatever was the address the final jump table block. In this case it's jtoolkit simply because toolkit is the last module. Thus there is a label, sjumptbl that is set equal to jtoolkit. smemory, the start address for the memory module therefore is sjumptbl minus the size ymemory, the size of the memory module. And the next module starts from smemory minus its own size, and so on through the modules. Thus, modules.h is a central place where all the variable data goes, the size of each module and the count of its exports, and it produces a set of labels that are j- prefixed and s- prefixed for the start addresses of each module's jump table block and code, respectively.


How can this be used, now?

Each module includes modules.h. Then sets is own assemble address to its s-prefixed label. Toolkit for example includes modules.h and positions itself by declaring *= stoolkit.

Each module's .h header file declares the labels for its jump table entries as offsets from the start of its j- prefixed label. These declarations are always right and so the header files themselves don't need to be touched.

  readmouse = jinput+0
  mouserc   = jinput+3
  deqmouse  = jinput+6
  readkcmd  = jinput+9
  deqkcmd   = jinput+12
  readkprnt = jinput+15
  deqkprnt  = jinput+18

  etc…

And lastly, you have the jump table itself. The actual jmp() instructions still need to be written, because Turbo Macro Pro (the native version) doesn't support the code generating pseudo labels that TMPx provides. But where those JMPs jump to is computed dynamically by the s- prefixed labels simply by including modules.h. Thus:

  JMP (sinput+0)
  JMP (sinput+2)
  JMP (sinput+4)
  JMP (sinput+6)
  JMP (sinput+8)
Module size and export count table

I still need to keep track of some things on paper. My table now looks like what is shown in the image above. But I don't have to make any of the intermediate calculations. And I don't have to enter any of those intermediate values manually across a myriad of files. I just update the modules.h file with the new modules' sizes and export counts. Then I update the jump table itself with the additional entries. And reassemble to object files everything that would be affected by the changes.

It's still quite a bit of work if I make changes to a module up high in memory that pushes the others below it to new start addresses. But, like my recursive backup script it saves a huge amount of effort and takes a big burden off my shoulders, so I can spend more time coding and less time worrying about how to fit things into memory.


There is one last interesting benefit that has popped up. If I decide I'm going to start working on the string module a lot, because I'm building out the set of routines for string editing and manipulation, ordinarily I'd be in a bit of a tough spot. string is 3rd highest in memory, with six or more modules below it. I'd hate to have to reassemble 7 modules(!) every time I make string a bit bigger.

Well, actually I don't have to. It's easy peasy in the modules.h file simply to move the string module to the bottom of the pile. Its jump table entries can stay where they are, although it would be handy to build out the entries in the jump table, even with placeholders for routines I know will be there soon. There is nothing inherently special about having string be the 3rd module instead of the 9th. It was just bothersome having to redo all the math.

Now that modules.h does all that math, it becomes trivial to rearrange the modules. If I know I'm going to work on a specific couple for a few days, I can just move them to the bottom of the pile at the start of my session, and all of a sudden the cycle time for testing them becomes very short again.


Just a quick update on toolkit. It has been the primary focus of the last month or so of work. I've put a lot of thought and a lot of code into it. However, it's still so in flux in my head that I haven't wanted to commit to writing anything about it yet. But soon. I've just recently worked out what I think will be a big score in making the drawing of views and clipping them to the insides of their bounds rect much more efficient and easier to program. As soon as I get a slightly better handle on how well that's going to work out, I plan to write my first blog post discussing the general architecture of the toolkit. Stay tuned.

August 4, 2017Programming Reference

6502 / 6510 Instruction Set

Every good Commodore 64 programmer needs to have the 6502/6510 instruction set at his or her fingertips. There are already many reference texts like this out there, however I find all of them to be lacking.

It is my goal for the presentation below to be the fastest, easiest–to–use and best organized 6502/6510 instruction set reference text on the internet. If you have ideas for how I can improve it, please let me know in the comments.

Full credit to 6502.org for the original source of this content, which I have formatted, rearranged, styled and fixed the HTML.

Alphabetically Ordered

A:

ADC AND ASL

B:

BCC BCS BEQ BIT BMI

  

BNE BPL BRK BVC BVS

C:

CLC CLD CLI CLV

  

CMP CPX CPY

D:

DEC DEX DEY

E:

EOR

I:

INC INX INY

J:

JMP JSR

L:

LDA LDX LDY LSR

N:

NOP

O:

ORA

P:

PHA PHP PLA PLP

R:

ROL ROR RTI RTS

S:

SBC SEC SED SEI

  

STA STX STY

T:

TAX TAY

  

TSX TXA TXS TYA

 

Execution Times

Op code execution times are measured in machine cycles; one machine cycle equals one clock cycle. Many instructions require one extra cycle for execution if a page boundary is crossed; these are indicated by a + following the time values shown.

Notes Links


Bitwise Instructions

AND (bitwise AND with accumulator)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     AND #$44      $29  2   2
Zero Page     AND $44       $25  2   3
Zero Page,X   AND $44,X     $35  2   4
Absolute      AND $4400     $2D  3   4
Absolute,X    AND $4400,X   $3D  3   4+
Absolute,Y    AND $4400,Y   $39  3   4+
Indirect,X    AND ($44,X)   $21  2   6
Indirect,Y    AND ($44),Y   $31  2   5+

+ add 1 cycle if page boundary crossed

EOR (bitwise Exclusive OR with accumulator)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     EOR #$44      $49  2   2
Zero Page     EOR $44       $45  2   3
Zero Page,X   EOR $44,X     $55  2   4
Absolute      EOR $4400     $4D  3   4
Absolute,X    EOR $4400,X   $5D  3   4+
Absolute,Y    EOR $4400,Y   $59  3   4+
Indirect,X    EOR ($44,X)   $41  2   6
Indirect,Y    EOR ($44),Y   $51  2   5+

+ add 1 cycle if page boundary crossed

ORA (bitwise OR with Accumulator)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     ORA #$44      $09  2   2
Zero Page     ORA $44       $05  2   3
Zero Page,X   ORA $44,X     $15  2   4
Absolute      ORA $4400     $0D  3   4
Absolute,X    ORA $4400,X   $1D  3   4+
Absolute,Y    ORA $4400,Y   $19  3   4+
Indirect,X    ORA ($44,X)   $01  2   6
Indirect,Y    ORA ($44),Y   $11  2   5+

+ add 1 cycle if page boundary crossed

ASL (Arithmetic Shift Left)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Accumulator   ASL A         $0A  1   2
Zero Page     ASL $44       $06  2   5
Zero Page,X   ASL $44,X     $16  2   6
Absolute      ASL $4400     $0E  3   6
Absolute,X    ASL $4400,X   $1E  3   7

ASL shifts all bits left one position. 0 is shifted into bit 0 and the original bit 7 is shifted into the Carry.

LSR (Logical Shift Right)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Accumulator   LSR A         $4A  1   2
Zero Page     LSR $44       $46  2   5
Zero Page,X   LSR $44,X     $56  2   6
Absolute      LSR $4400     $4E  3   6
Absolute,X    LSR $4400,X   $5E  3   7

LSR shifts all bits right one position. 0 is shifted into bit 7 and the original bit 0 is shifted into the Carry.

ROL (ROtate Left)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Accumulator   ROL A         $2A  1   2
Zero Page     ROL $44       $26  2   5
Zero Page,X   ROL $44,X     $36  2   6
Absolute      ROL $4400     $2E  3   6
Absolute,X    ROL $4400,X   $3E  3   7

ROL shifts all bits left one position. The Carry is shifted into bit 0 and the original bit 7 is shifted into the Carry.

ROR (ROtate Right)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Accumulator   ROR A         $6A  1   2
Zero Page     ROR $44       $66  2   5
Zero Page,X   ROR $44,X     $76  2   6
Absolute      ROR $4400     $6E  3   6
Absolute,X    ROR $4400,X   $7E  3   7

ROR shifts all bits right one position. The Carry is shifted into bit 7 and the original bit 0 is shifted into the Carry.


Program Counter

When the 6502 is ready for the next instruction it increments the program counter before fetching the instruction. Once it has the op code, it increments the program counter by the length of the operand, if any. This must be accounted for when calculating branches or when pushing bytes to create a false return address (i.e. jump table addresses are made up of addresses-1 when it is intended to use an RTS rather than a JMP).

The program counter is loaded least signifigant byte first. Therefore the most signifigant byte must be pushed first when creating a false return address.

When calculating branches a forward branch of 6 skips the following 6 bytes so, effectively the program counter points to the address that is 8 bytes beyond the address of the branch opcode; and a backward branch of $FA (256-6) goes to an address 4 bytes before the branch instruction.

Branch Instructions

Affect Flags: none

All branches are relative mode and have a length of two bytes. Syntax is "Bxx Displacement" or (better) "Bxx Label". See the notes on the Program Counter for more on displacements.

Branches are dependant on the status of the flag bits when the op code is encountered. A branch not taken requires two machine cycles. Add one if the branch is taken and add one more if the branch crosses a page boundary.

MNEMONIC                       HEX
BPL (Branch on PLus)           $10
BMI (Branch on MInus)          $30

BVC (Branch on oVerflow Clear) $50
BVS (Branch on oVerflow Set)   $70

BCC (Branch on Carry Clear)    $90
BCS (Branch on Carry Set)      $B0

BNE (Branch on Not Equal)      $D0
BEQ (Branch on EQual)          $F0

There is no BRA (BRanch Always) instruction but it can be easily emulated by branching on the basis of a known condition. One of the best flags to use for this purpose is the oVerflow which is unchanged by all but addition and subtraction operations.

A page boundary crossing occurs when the branch destination is on a different page than the instruction AFTER the branch instruction. For example:

SEC
BCS LABEL
NOP

A page boundary crossing occurs (i.e. the BCS takes 4 cycles) when (the address of) LABEL and the NOP are on different pages. This means that

CLV
BVC LABEL
LABEL NOP

the BVC instruction will take 3 cycles no matter what address it is located at.


Compare Instructions

CMP (CoMPare accumulator)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Immediate     CMP #$44      $C9  2   2
Zero Page     CMP $44       $C5  2   3
Zero Page,X   CMP $44,X     $D5  2   4
Absolute      CMP $4400     $CD  3   4
Absolute,X    CMP $4400,X   $DD  3   4+
Absolute,Y    CMP $4400,Y   $D9  3   4+
Indirect,X    CMP ($44,X)   $C1  2   6
Indirect,Y    CMP ($44),Y   $D1  2   5+

+ add 1 cycle if page boundary crossed

Compare sets flags as if a subtraction had been carried out. If the value in the accumulator is equal or greater than the compared value, the Carry will be set. The equal (Z) and sign (S) flags will be set based on equality or lack thereof and the sign (i.e. A>=$80) of the accumulator.

CPX (ComPare X register)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Immediate     CPX #$44      $E0  2   2
Zero Page     CPX $44       $E4  2   3
Absolute      CPX $4400     $EC  3   4

Operation and flag results are identical to equivalent mode accumulator CMP ops.

CPY (ComPare Y register)

Affects Flags: S Z C

MODE          SYNTAX        HEX LEN TIM
Immediate     CPY #$44      $C0  2   2
Zero Page     CPY $44       $C4  2   3
Absolute      CPY $4400     $CC  3   4

Operation and flag results are identical to equivalent mode accumulator CMP ops.

BIT (test BITs)

Affects Flags: N V Z

MODE          SYNTAX        HEX LEN TIM
Zero Page     BIT $44       $24  2   3
Absolute      BIT $4400     $2C  3   4

BIT sets the Z flag as though the value in the address tested were ANDed with the accumulator. The S and V flags are set to match bits 7 and 6 respectively in the value stored at the tested address.

BIT is often used to skip one or two following bytes as in:

CLOSE1 LDX #$10   If entered here, we
.BYTE $2C         effectively perform
CLOSE2 LDX #$20   a BIT test on $20A2,
.BYTE $2C         another one on $30A2,
CLOSE3 LDX #$30   and end up with the X
CLOSEX LDA #12    register still at $10
STA ICCOM,X       upon arrival here.

Beware: a BIT instruction used in this way as a NOP does have effects: the flags may be modified, and the read of the absolute address, if it happens to access an I/O device, may cause an unwanted action.


Processor Flags

The Interrupt flag is used to prevent (SEI) or enable (CLI) maskable interrupts (aka IRQ's). It does not signal the presence or absence of an interrupt condition. The 6502 will set this flag automatically in response to an interrupt and restore it to its prior status on completion of the interrupt service routine. If you want your interrupt service routine to permit other maskable interrupts, you must clear the I flag in your code.

The Decimal flag controls how the 6502 adds and subtracts. If set, arithmetic is carried out in packed binary coded decimal. This flag is unchanged by interrupts and is unknown on power-up. The implication is that a CLD should be included in boot or interrupt coding.

The Overflow flag is generally misunderstood and therefore under-utilised. After an ADC or SBC instruction, the overflow flag will be set if the twos complement result is less than -128 or greater than +127, and it will cleared otherwise. In twos complement, $80 through $FF represents -128 through -1, and $00 through $7F represents 0 through +127. Thus, after:

CLC
LDA #$7F ;   +127
ADC #$01 ; +   +1

the overflow flag is 1 (+127 + +1 = +128), and after:

CLC
LDA #$81 ;   -127
ADC #$FF ; +   -1

the overflow flag is 0 (-127 + -1 = -128). The overflow flag is not affected by increments, decrements, shifts and logical operations i.e. only ADC, BIT, CLV, PLP, RTI and SBC affect it. There is no op code to set the overflow but a BIT test on an RTS instruction will do the trick.

Flag Instructions (Processor Status)

Affect Flags: as noted

These instructions are implied mode, have a length of one byte and require two machine cycles.

MNEMONIC                    HEX LEN TIM
CLC (CLear Carry)           $18  1   2
CLI (CLear Interrupt)       $58  1   2
CLV (CLear oVerflow)        $B8  1   2
CLD (CLear Decimal)         $D8  1   2

SEC (SEt Carry)             $38  1   2
SEI (SEt Interrupt)         $78  1   2
SED (SEt Decimal)           $F8  1   2

Jump Instructions

JMP (JuMP)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Absolute      JMP $5597     $4C  3   3
Indirect      JMP ($5597)   $6C  3   5

JMP transfers program execution to the following address (absolute) or to the location contained in the following address (indirect). Note that there is no carry associated with the indirect jump so:

AN INDIRECT JUMP MUST NEVER USE A VECTOR BEGINNING ON THE LAST BYTE OF A PAGE

For example if address $3000 contains $40, $30FF contains $80, and $3100 contains $50, the result of JMP ($30FF) will be a transfer of control to $4080 rather than $5080 as you intended i.e. the 6502 took the low byte of the address from $30FF and the high byte from $3000.

JSR (Jump to SubRoutine)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Absolute      JSR $5597     $20  3   6

JSR pushes the address-1 of the next operation on to the stack before transferring program control to the following address. Subroutines are normally terminated by an RTS op code.

RTS (ReTurn from Subroutine)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Implied       RTS           $60  1   6

RTS pulls the top two bytes off the stack (low byte first) and transfers program control to that address+1. It is used, as expected, to exit a subroutine invoked via JSR which pushed the address-1.

RTS is frequently used to implement a jump table where addresses-1 are pushed onto the stack and accessed via RTS eg. to access the second of four routines:

LDX #1
JSR EXEC
JMP SOMEWHERE

LOBYTE
.BYTE <ROUTINE0-1,<ROUTINE1-1
.BYTE <ROUTINE2-1,<ROUTINE3-1

HIBYTE
.BYTE >ROUTINE0-1,>ROUTINE1-1
.BYTE >ROUTINE2-1,>ROUTINE3-1

EXEC
LDA HIBYTE,X
PHA
LDA LOBYTE,X
PHA
RTS

RTI (ReTurn from Interrupt)

Affects Flags: all

MODE          SYNTAX        HEX LEN TIM
Implied       RTI           $40  1   6

RTI retrieves the Processor Status Word (flags) and the Program Counter from the stack in that order (interrupts push the PC first and then the PSW).

Note that unlike RTS, the return address on the stack is the actual address rather than the address-1.


Math Instructions

ADC (ADd with Carry)

Affects Flags: S V Z C

MODE          SYNTAX        HEX LEN TIM
Immediate     ADC #$44      $69  2   2
Zero Page     ADC $44       $65  2   3
Zero Page,X   ADC $44,X     $75  2   4
Absolute      ADC $4400     $6D  3   4
Absolute,X    ADC $4400,X   $7D  3   4+
Absolute,Y    ADC $4400,Y   $79  3   4+
Indirect,X    ADC ($44,X)   $61  2   6
Indirect,Y    ADC ($44),Y   $71  2   5+

+ add 1 cycle if page boundary crossed

ADC results are dependant on the setting of the decimal flag. In decimal mode, addition is carried out on the assumption that the values involved are packed BCD (Binary Coded Decimal).

There is no way to add without carry.

SBC (SuBtract with Carry)

Affects Flags: S V Z C

MODE          SYNTAX        HEX  LEN TIM
Immediate     SBC #$44      $E9   2   2
Zero Page     SBC $44       $E5   2   3
Zero Page,X   SBC $44,X     $F5   2   4
Absolute      SBC $4400     $ED   3   4
Absolute,X    SBC $4400,X   $FD   3   4+
Absolute,Y    SBC $4400,Y   $F9   3   4+
Indirect,X    SBC ($44,X)   $E1   2   6
Indirect,Y    SBC ($44),Y   $F1   2   5+

+ add 1 cycle if page boundary crossed

SBC results are dependant on the setting of the decimal flag. In decimal mode, subtraction is carried out on the assumption that the values involved are packed BCD (Binary Coded Decimal).

There is no way to subtract without the carry which works as an inverse borrow. i.e, to subtract you set the carry before the operation. If the carry is cleared by the operation, it indicates a borrow occurred.


Wrap-Around

Use caution with indexed zero page operations as they are subject to wrap-around. For example, if the X register holds $FF and you execute LDA $80,X you will not access $017F as you might expect; instead you access $7F i.e. $80-1. This characteristic can be used to advantage but make sure your code is well commented.

It is possible, however, to access $017F when X = $FF by using the Absolute,X addressing mode of LDA $80,X. That is, instead of:

LDA $80,X    ; ZeroPage,X - the resulting object code is: B5 80

which accesses $007F when X=$FF, use:

LDA $0080,X  ; Absolute,X - the resulting object code is: BD 80 00

which accesses $017F when X = $FF (a at cost of one additional byte and one additional cycle). All of the ZeroPage,X and ZeroPage,Y instructions except STX ZeroPage,Y and STY ZeroPage,X have a corresponding Absolute,X and Absolute,Y instruction. Unfortunately, a lot of 6502 assemblers don't have an easy way to force Absolute addressing, i.e. most will assemble a LDA $0080,X as B5 80. One way to overcome this is to insert the bytes using the .BYTE pseudo-op (on some 6502 assemblers this pseudo-op is called DB or DFB, consult the assembler documentation) as follows:

.BYTE $BD,$80,$00  ; LDA $0080,X (absolute,X addressing mode)

The comment is optional, but highly recommended for clarity.

In cases where you are writing code that will be relocated you must consider wrap-around when assigning dummy values for addresses that will be adjusted. Both zero and the semi-standard $FFFF should be avoided for dummy labels. The use of zero or zero page values will result in assembled code with zero page opcodes when you wanted absolute codes. With $FFFF, the problem is in addresses+1 as you wrap around to page 0.

Memory Instructions

LDA (LoaD Accumulator)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     LDA #$44      $A9  2   2
Zero Page     LDA $44       $A5  2   3
Zero Page,X   LDA $44,X     $B5  2   4
Absolute      LDA $4400     $AD  3   4
Absolute,X    LDA $4400,X   $BD  3   4+
Absolute,Y    LDA $4400,Y   $B9  3   4+
Indirect,X    LDA ($44,X)   $A1  2   6
Indirect,Y    LDA ($44),Y   $B1  2   5+

+ add 1 cycle if page boundary crossed

LDX (LoaD X register)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     LDX #$44      $A2  2   2
Zero Page     LDX $44       $A6  2   3
Zero Page,Y   LDX $44,Y     $B6  2   4
Absolute      LDX $4400     $AE  3   4
Absolute,Y    LDX $4400,Y   $BE  3   4+

+ add 1 cycle if page boundary crossed

LDY (LoaD Y register)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Immediate     LDY #$44      $A0  2   2
Zero Page     LDY $44       $A4  2   3
Zero Page,X   LDY $44,X     $B4  2   4
Absolute      LDY $4400     $AC  3   4
Absolute,X    LDY $4400,X   $BC  3   4+

+ add 1 cycle if page boundary crossed

STA (STore Accumulator)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Zero Page     STA $44       $85  2   3
Zero Page,X   STA $44,X     $95  2   4
Absolute      STA $4400     $8D  3   4
Absolute,X    STA $4400,X   $9D  3   5
Absolute,Y    STA $4400,Y   $99  3   5
Indirect,X    STA ($44,X)   $81  2   6
Indirect,Y    STA ($44),Y   $91  2   6

STX (STore X register)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Zero Page     STX $44       $86  2   3
Zero Page,Y   STX $44,Y     $96  2   4
Absolute      STX $4400     $8E  3   4

STY (STore Y register)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Zero Page     STY $44       $84  2   3
Zero Page,X   STY $44,X     $94  2   4
Absolute      STY $4400     $8C  3   4

 

DEC (DECrement memory)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Zero Page     DEC $44       $C6  2   5
Zero Page,X   DEC $44,X     $D6  2   6
Absolute      DEC $4400     $CE  3   6
Absolute,X    DEC $4400,X   $DE  3   7

INC (INCrement memory)

Affects Flags: S Z

MODE          SYNTAX        HEX LEN TIM
Zero Page     INC $44       $E6  2   5
Zero Page,X   INC $44,X     $F6  2   6
Absolute      INC $4400     $EE  3   6
Absolute,X    INC $4400,X   $FE  3   7

Register Instructions

Affect Flags: S Z

These instructions are implied mode, have a length of one byte and require two machine cycles.

MNEMONIC                    HEX
TAX (Transfer A to X)       $AA
TAY (Transfer A to Y)       $A8
TXA (Transfer X to A)       $8A
TYA (Transfer Y to A)       $98

DEX (DEcrement X)           $CA
DEY (DEcrement Y)           $88
INX (INcrement X)           $E8
INY (INcrement Y)           $C8

Stack Instructions

These instructions are implied mode, have a length of one byte and require machine cycles as indicated. The "PuLl" operations are known as "POP" on most other microprocessors. With the 6502, the stack is always on page one ($100-$1FF) and works top down.

MNEMONIC                        HEX TIM
PHA (PusH Accumulator)          $48  3
PHP (PusH Processor status)     $08  3
PLA (PuLl Accumulator)          $68  4
PLP (PuLl Processor status)     $28  4

TSX (Transfer Stack ptr to X)   $BA  2
TXS (Transfer X to Stack ptr)   $9A  2

Other Instructions

BRK (BReaK)

Affects Flags: B

MODE          SYNTAX       HEX LEN TIM
Implied       BRK          $00  1   7

BRK causes a non-maskable interrupt and increments the program counter by one. Therefore an RTI will go to the address of the BRK +2 so that BRK may be used to replace a two-byte instruction for debugging and the subsequent RTI will be correct.

NOP (No OPeration)

Affects Flags: none

MODE          SYNTAX        HEX LEN TIM
Implied       NOP           $EA  1   2

NOP is used to reserve space for future modifications or effectively REM out existing code.

 

The original source for the above content is: http://6502.org/tutorials/6502opcodes.html

In case there are any errors in my rendition, or in the 6502.org source, here is the relevant section from the official C64 Programmer's Reference Guide: Chapter 5: Basic to Machine Language (PDF)

August 4, 2017Programming Reference

Commodore 64 PETSCII Codes

Here is the second most frequent table that I find myself using, after screen codes. The PETSCII table.

PETSCII code
(dec, hex)
Character
(up/gfx, lo/up)
PETSCII code
(dec, hex)
Character
(up/gfx, lo/up)
PETSCII code
(dec, hex)
Character
(up/gfx, lo/up)
PETSCII code
(dec, hex)
Character
(up/gfx, lo/up)
0 $00   64 $40 @ 128 $80   192 $C0
1 $01   65 $41 A a 129 $81 orange 193 $C1 A
2 $02   66 $42 B b 130 $82   194 $C2 B
3 $03 Stop 67 $43 C c 131 $83 Run 195 $C3 C
4 $04   68 $44 D d 132 $84   196 $C4 D
5 $05 white 69 $45 E e 133 $85 F1 197 $C5 E
6 $06   70 $46 F f 134 $86 F3 198 $C6 F
7 $07   71 $47 G g 135 $87 F5 199 $C7 G
8 $08 disable C=-Shift 72 $48 H h 136 $88 F7 200 $C8 H
9 $09 enable C=-Shift 73 $49 I i 137 $89 F2 201 $C9 I
10 $0A   74 $4A J j 138 $8A F4 202 $CA J
11 $0B   75 $4B K k 139 $8B F6 203 $CB K
12 $0C   76 $4C L l 140 $8C F8 204 $CC L
13 $0D Return 77 $4D M m 141 $8D Shift-Return 205 $CD M
14 $0E lo/up charset 78 $4E N n 142 $8E up/gfx charset 206 $CE N
15 $0F   79 $4F O o 143 $8F   207 $CF O
16 $10   80 $50 P p 144 $90 black 208 $D0 P
17 $11 cursor down 81 $51 Q q 145 $91 cursor up 209 $D1 Q
18 $12 reverse on 82 $52 R r 146 $92 reverse off 210 $D2 R
19 $13 Home 83 $53 S s 147 $93 Clear 211 $D3 S
20 $14 Delete 84 $54 T t 148 $94 Insert 212 $D4 T
21 $15   85 $55 U u 149 $95 brown 213 $D5 U
22 $16   86 $56 V v 150 $96 pink 214 $D6 V
23 $17   87 $57 W w 151 $97 dark grey 215 $D7 W
24 $18   88 $58 X x 152 $98 grey 216 $D8 X
25 $19   89 $59 Y y 153 $99 light green 217 $D9 Y
26 $1A   90 $5A Z z 154 $9A light blue 218 $DA Z
27 $1B   91 $5B [ 155 $9B light grey 219 $DB
28 $1C red 92 $5C pound 156 $9C purple 220 $DC
29 $1D cursor right 93 $5D ] 157 $9D cursor left 221 $DD
30 $1E green 94 $5E up arrow 158 $9E yellow 222 $DE
31 $1F blue 95 $5F left arrow 159 $9F cyan 223 $DF
32 $20 Space 96 $60 160 $A0 Shift-Space 224 $E0
33 $21 ! 97 $61 161 $A1 225 $E1
34 $22 " 98 $62 162 $A2 226 $E2
35 $23 # 99 $63 163 $A3 227 $E3
36 $24 $ 100 $64 164 $A4 228 $E4
37 $25 % 101 $65 165 $A5 229 $E5
38 $26 & 102 $66 166 $A6 230 $E6
39 $27 ' 103 $67 167 $A7 231 $E7
40 $28 ( 104 $68 168 $A8 232 $E8
41 $29 ) 105 $69 169 $A9 233 $E9
42 $2A * 106 $6A 170 $AA 234 $EA
43 $2B + 107 $6B 171 $AB 235 $EB
44 $2C , 108 $6C 172 $AC 236 $EC
45 $2D - 109 $6D 173 $AD 237 $ED
46 $2E . 110 $6E 174 $AE 238 $EE
47 $2F / 111 $6F 175 $AF 239 $EF
48 $30 0 112 $70 176 $B0 240 $F0
49 $31 1 113 $71 177 $B1 241 $F1
50 $32 2 114 $72 178 $B2 242 $F2
51 $33 3 115 $73 179 $B3 243 $F3
52 $34 4 116 $74 180 $B4 244 $F4
53 $35 5 117 $75 181 $B5 245 $F5
54 $36 6 118 $76 182 $B6 246 $F6
55 $37 7 119 $77 183 $B7 247 $F7
56 $38 8 120 $78 184 $B8 248 $F8
57 $39 9 121 $79 185 $B9 249 $F9
58 $3A : 122 $7A 186 $BA 250 $FA
59 $3B ; 123 $7B 187 $BB 251 $FB
60 $3C < 124 $7C 188 $BC 252 $FC
61 $3D = 125 $7D 189 $BD 253 $FD
62 $3E > 126 $7E 190 $BE 254 $FE
63 $3F ? 127 $7F 191 $BF 255 $FF

Notes:

  1. Codes $00-$1F and $80-$9F are control codes. Printing them will cause a change in screen layout or behavior, not an actual character displayed.
  2. Codes $60-$7F and $E0-$FE are not used. Although you can print them, these are, actually, copies of codes $C0-$DF and $A0-$BE.
  3. Code $FF is the BASIC token of the π (pi) symbol. It is converted internally to code $DE when printed and, vice versa, code $DE is converted to $FF when fetched from the screen. However, when reading the keyboard buffer, you will find code $DE for Shift-↑ (up arrow) as no conversion takes place there yet.
  4. Cleaned up from the original source: http://sta.c64.org/cbm64pet.html
Older PostsClick titles to see full content

The home page shows the full content of the 10 most recent posts.
Below are the titles of 5 posts older than the most recent 10.
Click their titles to view the complete post and read and leave comments.

August 3, 2017Programming Reference

Commodore 64 Screen Codes

August 1, 2017Programming Theory

Base Conversion in 6502 (2/2)

July 21, 2017Hardware

Commodore Logo Mark Patch

July 5, 2017Programming Theory

Object Orientation in 6502

June 30, 2017Programming Theory

Base Conversion in 6502 (1/2)

Archive

Full Archive of Past Posts…