Subscribe to with your favorite RSS Reader
February 20, 2020#97 Programming Theory

Stack Manipulation, Some Gotchas

Post Archive Icon

The 6502/6510 has a fairly limited stack. This is one reason why the processor is really not designed to do pre-emptive multi-tasking. The CPU is 8-bit, meaning all of its internal registers, (besides the program counter,) are 8-bits wide. The stack pointer, too, is only 8-bit. Which means the stack can only be 256 bytes deep.

There are two points here. On the one hand, that's way more than enough. Most 6502 programmers realize, after they've been programming for a while, that they rarely if ever use more than around one quarter of the total available stack. In normal 6502 code, the return address of a JSR is what the stack is most often used for. But that's only 2 bytes. You could call 120 subroutines each one nested within the previous one, and still not run out of stack. This is a huge amount of stack. It's so huge things like BASIC string conversion use the end of stack memory as temporary working space, because almost never is anything using that memory.

On the other hand, if you program in C, it's a different story. The nature of C is that all the arguments of a function, and all the additional local variables of a function are passed via and stored on the stack. This has the advantage that C functions are re-entrant by default, and recursion is therefore a breeze. This is particularly handy in a multi-tasking environment where those functions may be being shared concurrently by more than one process. And this technique is helped along by the fact that other more powerful CPUs have more flexible stack-relative addressing modes.

However, as long as we're not writing in C, or some other stack heavy high-level language, we do have a lot of stack available.


Recursion, though, even if written just in 6502 assembly, can still use a lot of stack. And although we have a lot, there are limits. So we have to be conscious of how deeply we will likely be recursing. If there is a lot of horizontal movement, but a limited amount of vertical depth, we'll probably be okay. Let's look at a few examples and just think about our likely stack use.

Navigating a File System

This is a classic use case for recursion. You have a harddrive, or an SD Card, and it's got nested subdirectories. You write a program that you want to search or navigate through the file system. I wrote a backup program (in BASIC no less) for recursively backing up my entire C64 OS codebase from the CMD HD to a uIEC/SD.

You start at a given root folder, and you search for all the subdirectories. Then you loop over the subdirectories, navigate into one of them, and recurse. Here, you don't want to push a list of subdirectory names onto the stack, that would use way to much. Instead, the directory names are added to the end of an array in main memory, and a set of counters and indexes into that array can be pushed to and pulled from the stack as you go.

Still, if you calculate that one recursive depth costs you, say, 8 bytes, it doesn't take a lot of math to see that you can't recurse deeper than 32 nested folders. In practice it would be fewer. On the other hand, check around your own file system. You will be hard pressed to find in the deepest bowels of even a modern file system folders nested deeper than around 10 levels. In C64 OS's codebase, nesting is never deeper (so far) than about 4 levels.

C64 OS Hierarchical Menus

Another example, this one is specific to C64 OS, is the hierarchical menu system. The code that parses the menu structures out of a menu definitions file has to do so recursively.

These never exceed 4 layers deep. No problem. I actually poked around in many macOS applications looking for examples of deeply nested menus. Besides those that are dynamically generated from, say, the file system (like popup folders on the right end of the Dock), it is exceedingly rare for regular menus to be more than just a couple of levels deep.

Sorting Algorithms

In 2018 I wrote a weblog post about implementing QuickSort in 6502. This would apply to anything you need to sort, and I'm now using it in my C64 OS file system library. This isn't about navigating the file system tree, but about sorting the entries of one directory so the files come either in alphabetical order or ordered by file size or type, etc.

These algorithms frequently make use of a divide and conquer strategy. Each division then gets subprocessed by recursively applying the same divide and conquer on only half the set, and then half of that set, etc. Each recursive level, if you're careful, takes up only a handful of bytes 4 to 8. So now we're talking about only being able to recurse 32 levels deep again. But, since you're dividing a group in half at each level, that's 2^32 (4 billion) items. If you start with 4 billion items, and you divide the group in half, and then in half again, you only have to do that 32 times until you're down to a group with only one item in it.

Due to other 8-bit limitations, I opted to limit the C64 OS quick sort routine (and file system library) to supporting 256 items. You could have more than 256 files in a directory, but to see and sort them all, you'd have to apply some pattern or filter. You only need to recurse a maximum of 8 levels deep. This will use no more than 1/4 of our available stack.

Object-Oriented View Hierarchy

I've posted about the C64 OS Toolkit before. And also about object-oriented 6502 code more generally. The view hierarchy, by its very nature, is rescursively nested. A root view, contains a split view, which contains a tab view, which contains a scroll view, which contains a form which contains a text label, and so on, much like that.

When you're hit-testing, the mouse clicks somewhere, and you have to figure out which view is immediately beneath the click. To do this you have to recursively search all of the children of the root view, and check to see if the click is within the bounds of that child. If it's outside the bounds of that child, well, you can abandon searching the rest of that branch of the hierarchy. But if it is within the bounds of a child, then you have to recurse and start searching in the same way through all of its children, then all of that child's children, until eventually, you find a child that you clicked that itself has no children, and boom that's the one.

Similarly, when drawing the view hierarchy. You draw the root view. But then the root view has two children, so you have to tell each of them to draw, but then each of them has its own children, and they need to be told to draw. Until, like in hit-testing, a given view is either scrolled out of bounds, or is marked as hidden by its own internal properties, in which case you can skip it and not worry about how many children it may contain. That's the general idea.

Once again, I've looked through many advanced macOS applications, and thought about the depth of view nesting required to build up their UI. The surprising finding is that, even in modern sophisticated apps, the nesting typically doesn't go very deep. In fact, if it does go too deep that's probably symptomatic of a poorly designed UI.

Cornerstone SVN Client for macOS

Take, for example, that blue "courses" tag in the ignore box on the far right. There is a lot of UI going on in this application. More complicated than most programs on the Commodore 64 ever get, though not necessarily more complicated than a C64 OS app's UI could be. The tag is in a box. The box is in a tab view. The tab view is in a horizontal split. That horizontal split is possibly in another horizontal split (unless it's a single split view that supports more than two split areas), and the split view is in a window. It may be that the macOS layout has got intermediate nesting going on, but to replicate something of this sophistication in C64 OS's toolkit, it's only about 5 nested levels deep. Even if it were 6 or 7, that's not anywhere near 32 levels deep.

Recursive View Drawing, and Context Backup and Restore

Let's stick with the recursive view hierarchy for the next leg of this journey. It isn't merely that you recurse through the drawing routines or through the hit-testing routines. I've described in an earlier post how the Context Drawing System works. In short, a child view's sizing and anchoring properties1 position it on the screen relative to the position of its parent view. Therefore, it is difficult (or, impossible) to start from a pointer to a view object and, by its properties alone, know where on the screen it is.

Whether drawing or hit-testing, the context is configured and the root view is sized to conform to it. Before the root view recurses to its first child it must either contract or move or contract and move the context. The child view then does just what the root view did, it sizes itself relative to the context. This continues recursively from child to child to child. However, when a branch of recursion terminates and unwinds, a parent view must change the context from how it was configured for the previous child to how it should be for the next sibling of that child.

The parent could manually undo the changes that were applied for the previous child, and then reapply new changes for that child's next sibling. This would work, but, modifying the context is slow. It makes much more sense to backup the context as it is, before modifying it. Then, to switch to the previous child's next sibling, it can restore the context from the backup and modify it for that next sibling. That's how context drawing works with the HTML canvas element, for example, which is derived from the Quartz drawing system in macOS.

This is a recursive process though. We can't just backup the context to a single fixed backup place. Instead we can push the bare minimum context onto the stack. The context is 11 bytes. So we are certainly going to use up a lot more stack with each level of recursion, and thus limit the number of levels deep we can go.

origin   2 bytes  ;Character memory origin
coloro   2 bytes  ;Color memory origin
bwidth   1 byte   ;Buffer width
width    1 byte   ;Context draw width
height   1 byte   ;Context draw height
otop     2 bytes  ;Top Scroll Offset
oleft    2 bytes  ;Left Scroll Offset

Given our screen resolution, memory constraints, etc. I've created many mockups of toolkit-based UIs for C64 OS utilities and applications, and they won't go more than about 4 or 5 levels deep. If we assume we have to push the THIS pointer (2 bytes), the context (11 bytes), and the return address (2 bytes), each recursive level requires about 16 bytes. That's enough for 16 levels of nesting. We're far from that limit.

Here's the thing though. To push and to pull the context requires small loops, at least 10 bytes each, as well as hardcoded references to the context address and size. That feels like a waste of space and an unfortunate repetition of code for every time we want to back up and restore the context. It also requires the app developer to know details about the internal workings of the drawing context. All of this is stuff we'd rather avoid.

The HTML canvas, on the other hand, sensibly puts this behind a pair of methods, .save() and .restore(). What would be nice is a system call to push the current state of the context. Then we could manipulate the context and recurse into a child. Upon returning from the child, to process its next sibling we could just restore the context with another system call and modify it again, and on it goes.

So I added two system calls, to the toolkit module, pushctx and pullctx. Both operate, implicitly, on the context currently configured. There is a bit of overhead in calling them, but if you call them in more than just two places in your code, it saves memory for them to be in their own routines.

That's perfect, right?

Stack Manipulation

Not perfect.

As soon as you think about it, it is obvious this can't be as simple as it sounds. The golden rule for pushing anything to the stack is: However many bytes you push to the stack, you must pull that many bytes off the stack before returning.

This is because the stack is used by JSR/RTS automatically. When you call JSR the two-byte return address is pushed to the stack. When you call RTS, the last two bytes on the stack are pulled, (then incremented by one,) and used as the new program counter, the address whence to carry on execution. You cannot JSR to a subroutine, push several data bytes onto the stack and then RTS. The RTS would pull your data bytes from the stack and try to use them as the return address, and—unless you're doing this on purpose with carefully crafted data—a crash would instantly follow.

But this is exactly what we're trying to do with a routine like, pushctx. We JSR to pushctx. Then it loops over the context and pushes 11 bytes to the stack. Now that it's done doing that, it needs to RTS. The very nature of this routine seems to violate the way that subroutines use the stack. Or consider pullctx. You JSR to pullctx, the last two bytes on the stack are now the return address, but we have to pull the context bytes from the stack before we return.

I considered making pushctx and pullctx macros. That helps abstract what's going on from the developer. But it doesn't save space. And it still hardcodes the address and size of the context. This is not great.

The trick to making this work involves some stack manipulation.

Let's consider what happens when you call pushctx. After pushctx returns we want the stack to hold the context, just as if we'd in-lined the loop. But when we first call pushctx the last 2 bytes on the stack are the return address. If we just pushed the context to the stack, we would sandwich the return address 11 bytes down in the stack. So the trick is quite simple. The first thing that pushctx does is pulls 2 bytes off the stack. Those are its own return address. It writes these to a temporary variable. Then it pushes the context data to the stack. And in order to return, the last 2 bytes on the stack must be the return address. So it pushes the backed up return address back to the stack, and then it calls RTS.

It adds a little bit of overhead. The stack manipulation, plus calling pushctx and returning, altogether double the size of the loop. If we call this routine from just two different places, we'll break even in memory usage, but with the benefit of abstracting the work. And if we call it three times, we're saving memory in addition to the other benefits.

It will work exactly the same way for pullctx. When we first JSR to pullctx, we want to pull the context from the stack. But, the top 2 bytes on the stack are the return address from pullctx. Once again, the first thing this routine does is pulls its own return address and backs it up. Then it pulls the context data from the stack. Lastly, it pushes its own return address back onto the stack before doing an RTS.

Things to Watch Out For

So, the above works. It's a bit weird, but it works, and we know why we're doing it.

Pushctx's and pullctx's have to be balanced before returning, just like individual instructions that push and pull data to and from the stack have to be balanced. The CPU instructions push and pull one byte at a time. But these two routines can be thought of as complex versions of these simpler instructions, that push and pull a larger number of bytes at a time.

In a sense, "JSR pushctx" should be thought of as a unit, akin to PHA or PHP. Just as you cannot call:

PHP ;Push processor status register to stack
RTS ;Return to sender

Similarly, you cannot do the following either:

JSR pushctx ;Push toolkit context to stack
RTS         ;Return to sender

Rather, just as you have to balance the stack calls, something like this, for example:

PHP ;Push processor status register to stack

;Do some work...

PLP ;Pull processor status register from stack
RTS ;Return to sender

You also have to balance your pushctx and pullctx calls, like this:

JSR pushctx ;Push toolkit context to stack

;Manipulate for another child

JSR pullctx ;Pull toolkit context from stack
RTS         ;Return to sender

Some Gotchas

If you've been doing 6502 assembly programming for a little while, you'll likely already know this small optimization. But, if you don't, well, here's a little optimization you can use in your code.

Let's say you write a routine that does some things, including calling other subroutines. If the last thing your routine does before its RTS is to JSR to some other subroutine, like this:

JSR markdirty ;Mark a layer as dirty... or whatever.
RTS           ;Return to sender

Then you can switch the JSR to a JMP, and drop the RTS. It saves a byte (the RTS doesn't have to be in memory), and it saves stack usage and execution time too. When you JSR, the CPU has to push the return address to the stack. This takes time and stack space. But, in this case, the return address is just to another RTS. This RTS then immediately pulls 2 more bytes from the stack and returns again. When you JMP instead, you don't push a return address to the stack. Therefore what is last on the stack is the return address that your own RTS would have used had you left your code as JSR/RTS.

It saves one byte of program memory and it saves two bytes on stack memory (not much under normal circumstances, but if you're doing recursion it allows the recursion to go deeper.) And it's faster. A JMP takes only 3 cycles, but a JSR takes 6 cycles, and then the RTS takes 6 more cycles! It's 4 times faster to use JMP in place of a JSR/RTS pair. It's often worthwhile to see if you can rearrange your code a little bit so that calling a subroutine is the last step.

Once you know this little trick, it's easy to start thinking that any JSR immediately followed by an RTS is equivalent to a JMP. And, under normal circumstances it is. But not in the circumstance where the routine you've JSR'd to manipulates the stack. As mentioned above, if you JSR pushctx and then immediately RTS, the program will crash.

But, here's what's twice as confusing. You actually can JMP pushctx… and it won't crash. Let's get this straight. Normally you optimize by converting a JSR immediately followed by an RTS to a JMP. But if you JSR pushctx and then RTS, you'll crash. So if we convert JSR pushctx RTS to JMP pushctx, it should crash too right? Nope, that actually works. Why? Because when you JMP to pushctx the last 2 bytes on the stack are the return address not back to the routine you just JMP'd from, but the routine that called it. And that's the address that gets pulled off and backed up. Then the context is pushed on to the stack, the return address is restored, and the RTS at the end of pushctx returns you successfully back to the routine before the one you just called JMP from.

So, while JSR RTS is usually equivalent to JMP, that isn't necessarily the case, because the subroutine could be manipulating the stack.

All that said, you could leave the current routine by JMP pushctx. However, the routine it returns to would have to know that that's the expected behavior. And before it does its own RTS it would have to be responsible for the pullctx. Honestly, I really don't recommend ever doing that. I recommend never using JMP pushctx.

How about JMP pullctx? This you can never do. And why is that? Because the expectation of pullctx is that, before you call it, the context is the last thing on the top of the stack. When you call JSR pullctx, it has to remove the 2-byte return address first, then pull the context. If you JMP pullctx, there is no return address on the top of the stack to pull. The first 2 bytes pulled will actually be part of the context. Then the rest of the context will be pulled incorrectly,2 and finally it will restore the incorrect return address, and the RTS at the end of pullctx will jump off to no-man's-land, and the program will crash.

So while JMP pushctx is not recommended, JMP pullctx is not possible. Just the fact of this asymmetry is, in my opinion, a pretty good reason to avoid JMP pushctx.

Final Thought

Assembly is a wild and wonderful world. There are no safety checks though, and you have access to very low-level features of the CPU.

Just remember the Peter Parker principle,

With great power comes great responsibility.
  1. These properties were known in early versions of Mac OS X as springs and struts. []
  2. It'll be misaligned by 2 bytes, all the way to the end, and then the actual return address will be pulled and copied into the last 2 bytes of the current context! Oops. []