NEWS, EDITORIALS, REFERENCE
Playing Back PCM Audio
Terry Raymond is an old school North American, Expo-going Commodore 64 fan. Throughout the years he has asked me and others to give him helpful tips on how to program his C64. He has been interested in what it would take to make a Wave audio player, particularly for GEOS or Wheels. A valiant goal, in my personal opinion. And a fun project.
The problem is that, to really do justice to the questions, I feel like I'd have to write a rather long response. I wouldn't want to do that in an email, because the time spent would not benefit anyone other than the sole recipient of the email. Instead, I've decided that this is a great opportunity to turn the dialog of the email, with my long-form responses, into a blog post. I do love the topic of digital audio, after all. This seems as good a time as any to talk about it.
Hey back in 2008 I had wanted to code something in GEOS possibly to play Riff 8-bit Mono wave files, well later in 2009 I contacted Werner Weicht he said to do this would need Wheels since it can handle its memory better, since it requires some sort of memory expansion to boot.
Yes, I remember that you were interested in making a Riff/Wave player for GEOS or Wheels. Digital audio requires a lot of data for only a little playback, and an OS is going to use a bunch of main memory. So, I agree that you would be best off targeting a platform that either requires an REU or that provides some extra REU support.
Well Werner helped me enough to have a barely working app:
WavePlay 128 40/80 column support
The drawback to the app is that it needs some type of memory of course for the buffer, so we went with the App code space (memory for the App code) this can be expanded to 30K by tapping into the Background Screen Ram.
So WavePlay can only play approximately 1 second or so of a Wave file.
To get everyone on the same page, let's talk about how digital audio works and then about sound quality, and calculating memory requirements.
You start with an analog sound wave. You want to digitally encode that sound wave, so that it can be reproduced later by a computer. The sound wave has amplitude, the distance above and below the zero center line that the wave goes. There is a maximum range for this amplitude, even if your actual sampled sound wave doesn't utilize this full maximum.
When converting to digital, we need to decide what range of numbers is appropriate to be able to represent the wave's amplitude at any given moment. A 4-bit sample would mean you only get a range of 0 to 15, or 0 to 7 above the center, and -8 to 0 below the center. This is a very low resolution and you'd get only very rough sounding reproduction from this. The sort of thing you hear in Space Taxi when the little man waves and calls out "Hey Taxi."
An 8-bit wave file doesn't just double the quality of 4-bit, each additional bit doubles the quality. So 5-bit is twice the resolution of 4-bit, 6-bit doubles the resolution of 5-bit, 7 doubles 6, and 8-bit doubles the resolution of 7-bit. So, 8-bit samples are a radical step up from 4-bit. I personally feel that 8-bit samples are sufficient for most things, sound effects and speech, such as for podcasts. Additionally, an 8-bit CPU, like the C64 has, is ideally suited to work with and process 8-bit samples. You had said you wanted to work with 8-bit samples, so that's a good choice.
But this is far below CD Quality. Compact discs use 16-bit samples. And remember, 9-bit is double the resolution of 8-bit. So 16-bit is 8 doublings more than 8-bit. 16-bit is ideal to reproduce high quality music. But 16-bit samples require twice the data storage (and, ergo, twice the loading time), twice the memory capacity, and more CPU time to playback than 8-bit samples.
When we talk about the bits per sample, what we're saying is this: At a given moment in time you sample the analog audio wave. And you use a single number (4, 8 or 16-bit number) to represent the position of the wave at that moment, vertically, from the bottom to the top. -8 to 7 for a 4-bit sample, -128 to 127 for an 8-bit sample, or -32768 to 32767 for a 16-bit sample.
But to reconstitute the form of the wave over time, we need many many samples, each taken at a regular interval. CD Quality audio has a sample rate or 44.1kHz. 1 kilohertz equals 1000 samples per second. 44.1 is therefore 44,100 samples every second. If the samples were each 8-bit, that's 1 byte times 44,100, that's 44,100 bytes or ~43 kilobytes for just 1 second of audio. If the samples are 16-bit, then it's ~86 kilobytes for just 1 second of audio. If you don't have some form of expanded ram, these samples are effectively impossible to work with. They're just too big.
I wouldn't go down to 4-bit samples. Because, the 8-bit CPU works on memory in 8-bit chunks. Even if you had 4-bit samples packed together on disk, you'd probably need to expand them in memory, such that the data sat in the low nybble and the upper nybble was just garbage. So you wouldn't actually save memory by using 4-bit samples, but you'd lose a ton of sound quality.
You don't have to use the CD Quality sample rate or 44.1kHz though. 22.050kHz is, in my opinion, quite suitable for many purposes, even music. And that would give you 1 second of playback in just 22,050 bytes or ~21.5 kilobytes. You could also go down to 11.025kHz, and get 1 second of playback in just 11 kilobytes of memory. I personally think 11kHz is acceptable for many situations. You hear the scratchiness, but it's often quite tolerable considering that we're doing this on vintage computer hardware. I personally wouldn't recommend going below 11kHz.
Channels: Mono, Stereo
There is another metric of a wave file's size. It's the number of channels of audio. Everything I discussed above is for a single channel, aka mono audio. If you have stereo audio, you need all of that storage capacity per channel!
Thus, calculating the size of data per 1 second of playback is pretty easy. It's bits per sample times samples per second times number of channels.
|Memory Size *
Well after we got the Wave file to load the app would play the Riff Wave file too fast, sounded like the chipmunks. So Werner worked up a small routine as a delay using Decimal numbers from 1 to 255. I tinkered with this and adjusted it to where the Wave file sounds the best. Would this be about CD quality?
So you said you wanted to support 8-bit mono Riff's. "RIFF" is a special file format, Resource Interchange File Format. It's a way of storing data in a file in a standard way that can be interpreted by the player software. The Windows .WAV file format is a type of RIFF file. You can read more about RIFF here.
What we really need to care about, the meat and potatoes of playing (uncompressed) digital audio, is how the audio wave samples are stored in the file. The most common way to do this is PCM, Pulse-Code Modulation. You can read more about PCM here.
PCM is used by .WAV, .AIFF, .PCM and .AU files, as well as being the encoding format used on CDs. The specific file wrapper needs to be supported so that you know how far into the file the PCM data (the raw audio sample data) begins, and also the relevant details for this file: Bit rate, sample rate, and number of channels. You could also allow the user to manually set the bit rate, the sample rate and the number of channels, and the offset into the file for the start of the data, and then not worry about interpreting the file's headers, at least as a first step, or an advanced option.
If the audio sounds slow and the pitch is too low, that's because you're playing back audio that was sampled at a higher frequency than you're playing it back at. It was sampled at 22kHz, but you're playing it back at 11kHz, for example. That would cause it to take twice as long to playback, and it would sound low and slow.
If you're hearing chipmunks, that's because you're playing it back at a frequency that is higher than it was originally sampled at. It was sampled at 22kHz, but you're playing it at 44kHz, say. Not necessarily doubled or halved, but just any rate which is not the original sampling frequency, will cause the sound to be distorted.
I tried playing the different frequencies of Wave files on a PC they sometimes sound funny anyway, so yeah I guess CD quality is the only frequency that would sound best I suppose, what do you think?
The only thing that should be described as CD Quality (no more and no less than CD Quality), is 2 channels (stereo), of 16-bit samples, sampled at 44.1kHz. And the only way to reproduce that quality is to have an output digital audio device (DAC, Digital-Analog Converter) that supports two channels and 16-bit resolution. A C64, straight out of the box, definitely doesn't include any hardware that can handle that.
A single SID chip, for the longest time at least, was thought to be able to output a single channel of 4-bit resolution samples. Newer techniques, that I've never looked into and don't yet understand, have been able to squeak 8-bit resolution mono samples from the SID. I suppose you could get 8-bit stereo (depending on how much CPU time is required) if you had two SIDs. Dual SIDs is a very common expansion for the C64. And some modern mainboards like the Ultimate64 and the C64 Reloaded have two SID sockets built-in.
Whenever I did anything with digital audio on the C64, I used an external digital audio expansion device. Namely, a DigiMAX, by Digital Audio Concepts. These are commercially available from Shareware Plus, and they're listed in the Commodore 8-Bit Buyer's Guide. Along with a bunch of options for Dual SID expansions if you'd rather go that route.
DigiMAX Stereo Digital Audio Expansion
The DigiMAX is quite simple. It plugs into the UserPort and more or less connects a quad 8-bit DAC directly to CIA 2's Port B. This board makes it very easy to playback stereo digital samples. It uses 10 data lines total on the UserPort, two bits allow you to specify which one of its four DACs you want to write to, and then 8 bits to write a byte to the selected DAC. Two DACs are hardwired to be mixed together on a single output channel, so its output is stereo. But even this board only supports 8-bit samples. You'd need 16-bit samples to attain CD Quality. We can come back to this later.
The first problem we ran up against is that GEOS/Wheels for years only recognized Geowrite or Geopaint files. But ole Maurice had trick code to help get around this very issue, so Werner adapted that trick code to work with WavePlay. I typed this routine in and assembled and Linked the code, tried it and it worked.
I'm not 100% sure what this is about. But my guess is that GeoWrite and GeoPaint documents, being GEOS native files, have special headers that identify what kind of file they are. When you double click a GeoPaint document it opens the GeoPaint application which then automatically opens that document, skipping the file picker dialog box.
I assume this new code is for being able to double click a non-GEOS file, perhaps just a file with a name extension like ".wav" and have it open an associated Wheels application. For me, this is getting a bit far afield from the mechanics of playing digital audio data. I don't know much about how GEOS/Wheels handles applications and their associated document files.
A few years ago I had wanted to update the buffer in WavePlay to use any larger REU's, but
then found what Bo [Zimmerman] got for Wheels notes, etc. that Maurice Randall kindly helped
with back then. Maurice created Wheels OS to recognize any type of REU's including SuperCPU['s
SuperRAM] and Ramlink DACC partitions. Maurice also donated example code on how to:
Allocate Ram Banks
My main problem is that I dont understand in Werners code for WavePlay that taps into the App codespace Ram. So I could take the App codespace code out and add in Maurices nifty Allocation and REU routine.
Types of Memory Expansion
I don't know enough about Wheels or Wheels programming to say anything useful about that. But I do understand the essentials of C64 memory, C64 programming, and digital audio.
Let's start with the basics. Unless the only thing you want to do is play a 2 second digital audio clip, you need more memory than an unexpanded C64 provides. In theory you could stream data from storage out to the digital audio playback device (SID, DigiMAX, etc.) But in practice most storage devices are too slow. The exceptions are, perhaps, an IDE64 or a RamLink, because they are plugged into the Expansion port, and an REU also connects via the expansion port, after all.
So you need to expand memory, which you use as a buffer that can be accessed faster than a standard storage device. The idea is to take whatever time is necessary to load the data from storage into the expanded memory. And when that's ready, you can play the clip back by swapping in chunks from the expanded memory very quickly. One problem you've already identified is that there are multiple types of RAM expansion.
The most common type is the Commodore 1750 REU, and all of its siblings and compatible clones. These all use the REC chip, or a modern reimplementation of the REC chip, so they are all programmed the same way. They just vary in capacity from 128KB to 16MB. The second most common type is the GEORAM. Both of these types of REU are still commercially available, which is great. That's a compelling reason to support either one of them, or both.
Much less common types would include the RamLink and its DACC (Direct Access) partitions. The RamLink is not commercially available anymore, and there aren't all that many floating around the world in the first place. I mean, you could support this, but I wouldn't support it as the first step. The other uncommon solution is the SuperCPU with a SuperRAM card. These are very rare, and very expensive. It would be great to support it, (I own two SuperCPUs, both with SuperRAM cards,) but I'd personally target the ram expansions that are still commercially available first. Even if only to give current users one more excuse to go out and buy an REU or GEORAM and support the existing hardware ecosystem.
Playback from Main Memory
To make things easy, let's start with playback from main memory only. Let's also suppose that we want to use the DigiMAX for our output device, to playback a simple 8-bit mono audio clip. We can set the data lines to select the DigiMAX's receiving DAC just once, right at the beginning. Now the only thing necessary to play the clip is to read one sample byte from memory, and write it to CIA 2's Port B. The I/O devices are found at the following addresses:
|$D000 - $D3FF
|$D400 - $D7FF
|$D800 - $DBFF
|$DC00 - $DCFF
|$DD00 - $DDFF
|I/O 1 (External)
|$DE00 - $DEFF
|I/O 2 (External)
|$DF00 - $DFFF
So all of CIA 2's registers are found at $DDxx. Port B data register is register $01. Therefore we just need to read a byte from memory, and write to $DD01, something like this:
LDA buffer,x STA $DD01
That outputs one sample. Next, we must delay for some precise fixed period of time, and then read the next sample byte from the buffer and write it out to $DD01. The DigiMAX's DAC handles everything else that there is to converting those repeated sample writes into something audible. How long to delay for, and how exactly to implement the delay is a great question. How long to delay depends on the sample rate. If the sample rate is 11025 Hz, then we must write another sample, to $DD01, 11025 times a second or the audio will sound distorted. Too fast? Chipmunks. Too slow? Molasses.
I can think of at least two straightforward ways to generate the timing. One way would be with a timing loop. Just killing time by dead looping the CPU between writes. But the more I think about this, the worse an idea it seems. You'd have to know how long your code will take to run, and take that into account when calculating the loop delay. And you might have to disable interrupts or even turn off the screen to have it work right. In short, this is probably not a very good idea.
Alternatively, each CIA chip also has two interval timers that can generate an interrupt. I'm pretty sure this is the reason the CIAs have interval timers, is to be able to do stuff like this. And the best part about using a timer that's coming from a source external to the CPU is that it counts very reliably, regardless of what the CPU is doing. So let's think about how this would work.
Only CIA 1's interrupt line is connected to the CPU's IRQ line though. CIA 2's interrupt line is connected to the CPU's NMI line, but this could be useful! According to the Commodore 64 Programmer's Reference Guide, each interval timer consists of a 16-bit read-only timer counter, and a 16-bit write-only timer latch. The timers can be configured in a number of ways, and can count down on each clock cycle. Therefore, if we know the frequency of the clock (and there are two clock rates, NTSC and PAL) then we can do a little napkin division.
NTSC clock rate: 1022727 Hz Sample rate: 11025 Hz 1022727 / 11025 = 92.76 PAL clock rate: 985248 Hz Sample rate: 11025 Hz 985248 / 11025 = 89.36
So, on an NTSC machine, for a sample rate of 11025 (aka 11kHz), we must write another sample byte every 93 cycles. Or if it were a 22kHz sample, we'd have to write another sample byte every 46 cycles. This is not very many cycles between sample writes. It should become clear that the higher the sample rate the less time there is for the CPU to do anything else except push data out the UserPort. For a 44.1kHz sample, there are only 23 cycles between each sample. Higher sample rates than 44.1kHz are probably physically impossible. There just isn't enough time to read that much data and write it to an output device with a clock speed of only 1MHz.
But let's stick with 11025Hz mono audio data. We need to delay 93 cycles before writing the next sample. We can write the number 93 into, say, Timer A of CIA 2. And then configure it for continuous mode. Without any other intervention from the CPU, the timer will tick down the counter on every clock cycle until it hits zero. At which point it will generate an interrupt, which will be an NMI on the CPU, the timer counter will be automatically reset to 93 and continue counting down on each clock cycle even while the CPU is busy handling the NMI.
We can then write a custom NMI handler, which we can point the CPU at when we begin the playback of the audio clip. Each time the NMI handler is called, we have to read a byte from memory, write it to the CIA, and increment an index into where we are in memory for the next read, and we have to do all of that in less than 93 cycles. This really should not be a problem. Let's just think about how many cycles are necessary to read in an index from an absolute memory address, then read in a sample byte from an absolute indexed memory address, then write a byte to an absolute address, and then increment the index at an absolute memory address: (If it becomes necessary we can move the index to a zero page address to save a bit of time.)
LDX index ; 4 cycles LDA buffer,X ; 4 cycles (if it doesn't cross a page boundary) STA $DD01 ; 4 cycles INC index ; 6 cycles ----------------------- 18 cycles
We only need 18 cycles to move the sample data, and that leaves plenty of time to deal with some NMI preparatory and exit work, such as backing up and restoring the registers we'll use. That should also leave plenty of time for the CPU to continue responding to user input. So we should be able to move the mouse around and click on buttons while the sample is playing. This would be useful, such as to be able to pause the playback (or skip to another part of it if we got really fancy.)
REU Memory vs Main Memory
The problem, as we discussed earlier, is that you'll run out of memory really quickly. In fact, in the example above I'm just looping the X index register repeatedly over a single page of buffer memory. 256 bytes is enough for 256 / 11025 = 0.02 seconds of playback! We'd need a hundred pages of memory for 2 full seconds of playback. To play back for longer than 2 seconds, at some point we need to fetch more data from an REU.
Let's focus just on the 1750 REU and its compatible relatives for the moment. The next thing to know is that you cannot address individual bytes in extended memory using simple CPU instructions like you can address the bytes of main memory. The CPU's addressing bus is 16-bit, and so $0000 to $FFFF are the only values that can be accepted by an instruction in absolute addressing mode. This full range will only ever reach data that is somewhere within the 64 kilobytes of main built-in memory. So how does this work? How do we access the memory in the REU?
When an REU is plugged in, it supplies a special external I/O chip that typically maps into the I/O 2 range, $DFxx.1 That I/O chip is the RAM Expansion Controller or REC chip. This chip can be programmed, by writing to its registers, and commanded to transfer data via DMA (direct memory access 2). The REC can be given an address range within the REU's memory banks, and an address within the C64's main addressing range, and then it can be instructed to do one of three types of transfer:
- Copy contents from the REU into the C64, overwriting whatever is currently there.
- Copy contents from the C64 into the REU's memory, overwriting whatever is currently in that memory range of the REU.
- Or, (at half the speed of a one-way transfer) it can be commanded to swap the two memory areas.
The maximum theoretical speed of a DMA transfer on an 8-bit bus running at 1MHz is very easy to compute. It's 8-bits per cycle, or 8 * 1022727 bits per second (on NTSC), or 1022727 bytes, or around 0.97 megabytes per second. In practice it is a bit slower than that, as we'll see, but even in theory it cannot possibly be any more than that. This is a very simple calculation, because the smallest unit of precision is one clock cycle, and the bus can only carry one byte at a time, for a theoretical maximum of one byte per clock cycle. If you have a million cycles in a second, that's a million bytes in a second.
Nearly a megabyte per second is really damn fast! But still, digital audio data is big too, and the length of time that passes between the processing of two samples is very short. So we still need to think about how the REU transfers will work to allow for continuous and uninterrupted playback. We definitely wouldn't want a skip in the audio every time we took a break to transfer in another chunk from the REU.
Interrupts and DMA Transfers
I've never actually programmed an REU-backed digital audio player for the Commodore 64 before. The sample playing programs I wrote and played around with in the early 2000s were for WiNGs. WiNGs requires a SuperCPU and SuperRAM and its APIs abstract all of this technical detail away. In many ways that makes writing the code much easier, but on the other hand, getting into the technical details is half the fun! So I'm still learning about this myself. But, I love sharing what I've learned, and I try to expand upon it to make it clearer and easier to understand for other people who may be interested but find it all kinda confusing.
When the CPU is executing your program code like normal, and I/O is patched in, you can write to the registers of the REC, and prepare it to transfer data from the REU into main memory. The moment you trigger the REC to begin the transfer, however, it gets to work and does so by initiating a DMA transfer. The way a DMA transfer works is by a device activating the /DMA line on the expansion port. The line is low active, so the device pulls the line low when it wants to perform DMA. Let's take a quick look at the schematics to see what this does:
Detail has been removed for clarity of focus.
The bar along the bottom is part of the C64's expansion port, showing part of the address bus, the complete data bus, then a line labeled /DMA, the BA line (which we'll return to in two seconds) and a few other lines that I've cropped away to the left of that. The bar over the label DMA indicates that it is low-active, activated by sinking voltage rather than sourcing voltage. There is a 3.3K pullup resistor connecting the line to +5V. That is there so that whenever nothing is plugged into the expansion port, or nothing is hooking up the DMA line explicitly, the line will be pulled up to +5V and be deactivated by default.
Amazingly, the /DMA line is connected to very few things inside the C64. Besides the pullup resistor, it has only two other connections and that's it. Both connections are to the inputs of AND gates (both AND gates are found on the same physical IC, by the way, U27. The 74LS08 provides 4 independent 2-input AND gates. The /DMA line is connected to two of those.) So what's going on here? Each AND gate has two other inputs as well, of course. AEC for one and BA for the other. AEC and BA both come from the VIC-II chip, and are used to allow the VIC-II to tell the CPU to disconnect itself from the address bus (AEC) and to pause execution altogether (BA). The VIC-II uses these two lines to allow itself to share accesses to memory with the CPU. On every phase 1 of the clock cycle the VIC-II asserts AEC so that it can set the address on the address bus and the CPU won't interfere. On every phase 2 of the clock cycle AEC is released and the CPU sets the address on the bus again, and this alternates back and forth constantly with every clock cycle.
Every 8th raster line of the bitmapped area of the screen, however, the VIC-II needs a bit of extra time to grab screencodes (and possibly sprite data, this part is still unclear to me), and so it asserts the BA line to fully pause the execution of the CPU for some number of cycles while it fetches the extra data. These are what are known as "bad lines". I recently wrote about my adventures with raster interupts and bad lines in the post Raster Interrupts and Splitscreen.
So, how does this relate to DMA transfers? Okay, well, we can see in the schematic above that the VIC-II's AEC line and the /DMA line from the expansion port are both inputs to one AND gate, whose output goes to the CPU's AEC line. And we can also see that the VIC-II's BA line and the /DMA line are the inputs to another AND gate whose output goes to the CPU's RDY line. The RDY line is what pauses the CPU's execution. Ergo, when /DMA is activated (brought low) it is going to bring both AND gates low, which will take the CPU's addressing lines off the address bus and halt its execution. With the CPU halted and off the address bus, the REC chip on the expansion port is free to write data to the C64 via the address and data buses as though it were a special kind of CPU.
Note that the /DMA line does not connect directly to the VIC-II. That means that the VIC-II, and hence video output, is not disrupted by a DMA transfer. Therefore the VIC-II continues to make its memory accesses on phase 1 of each clock cycle, and the DMA transferring device accesses the bus on each cycle's phase 2, just like the CPU would. But then there is still the problem that the VIC-II occasionally needs to halt the CPU for its bad line accesses, during which it needs to access memory in both phases. To handle this, the VIC-II's BA line goes to the expansion port too, and (I looked this up in the REU schematics to make sure,) the REC chip also monitors this line. In other words, on every bad line, not only can the VIC-II pause the CPU but it can also pause a DMA transfer in progress. Very slick!
But now let's think about what this means from the perspective of your own program running on the CPU. The moment your code triggers an REU transfer, the REC chip initiates a DMA and the CPU is paused. For starters, that means that it is impossible to stop an REU transfer midway through, because the CPU isn't executing any code during the whole transfer. But for some (many, most?) cases this is actually really cool. From the CPU's perspective, every REU transfer is instantaneous. What do I mean by that? Imagine that instead of transferring digital audio sample data, you were transferring executable code from the REU. Your program contains a few instructions to setup and trigger the transfer, and yet no matter how much data you are moving, on the very next instruction of your program you are always safe to JMP directly into the code you just DMA'd into main memory. From your program's perspective, every transfer, of any size, is always completed before even the very next instruction gets executed. A DMA transfer is like putting the CPU into cryogenic stasis, and when it wakes up it hasn't experienced the passage of time, but its whole world has potentially changed around it. Very cool.
There is a downside though. There is more to the world outside the CPU than just the contents of memory. Recall that to get precise and reliable timing for the sample rate of our audio playback we configured one of the interval timers of CIA 2. While a DMA transfer is happening, the CPU may be paused, but the CIA continues to count down its counters on every clock cycle.
At 11kHz there are only 93 cycles between samples. Many of these cycles are taken up with processing one sample, and handling the NMI overhead. So, let's say we're left with 50 some-odd cycles before the next sample. Assuming we have one memory page full of samples, 256 bytes or 256 8-bit mono samples, then on most of the samples all 50 of the spare cycles can be used to continue to process your application code, move the mouse, process keyboard and click events, update the screen, etc. But when we reach the end of the page (which we could tell, for example, because the index will roll over to 0,) we need to transfer a new page of samples from the REU into main memory. The easiest thing would be to just overwrite this very same memory page, the index has already rolled over to zero and its ready to keep on reading from the same memory page, if only the data were updated.
The problem is that despite the great speed of a DMA transfer, it will require an absolute minimum of 256 cycles to transfer a full 256-byte page of sample data. If we begin the transfer as part of the NMI handler, immediately after processing the final sample, we have only 50 cycles before the next NMI, but we need 256 cycles to DMA the full page. If we try to DMA the full page, the CPU will be paused at the time when the CIA generates the next interrupt. In fact, it'll still be paused by the time the CIA generates a second or maybe even a third interrupt. This means that at 11kHz it will get behind by 2 samples every 0.02 seconds of playback. You may also notice a distortion in the sound. And this problem will only get worse if you bump up to 22kHz.
I had to ask about this on Twitter. Thanks to Jon Woods (@Kodiak64) for the link to a conversation about this on Lemon64.com. The answer came in this message from just last year, by Tobias, who I understand is the same person as Groepaz:
Interleaving DMA transfers
Somehow, we have to perform DMA transfers that last less than the number of spare cycles we have after processing an NMI for one sample but before the next NMI will fire. At 11kHz my guess is that that will be around 50 cycles. It may be a little more depending on how tight we can make the code, but we'll go with 50 as our working number because I'm reasonably sure we won't have fewer than 50.
This means we really can't transfer more than 50 bytes at a time in a single DMA transfer. If we start dividing the page up into evenly divisible chunks, cut it in half, that's 128 bytes. Still too much. Cut that in half, that's 64 bytes, still just a bit too much. Cut that in half again, and it's 32 bytes. There should be time for that. Although we must consider that there will be some additional overhead in the more complex programming of the REC chip.
How I'm thinking this would work is that 31 samples would be sent to the DigiMAX like normal, one each per NMI. But after the 32nd sample, you would spend the remaining spare ~50 cycles setting up an REU transfer and DMA transferring the next 32 bytes. But as I think about it, even that is insanely tight. 32 of those cycles would be taken up by the transfer itself, leaving only 18 cycles to setup the transfer. That's probably not enough time. So maybe 15 samples get sent normally, and then every 16th sample you prepare the REU and transfer the next 16 bytes. That's 50 spare cycles - 16 for the transfer, leaving you 34 cycles to setup the REC. I think this would work, but I'm clearly inexperienced in this area. It still feels really tight. Pulling off 22kHz, or 11kHz stereo, with this method is starting to feel like it might be impossible.
Actually, I have another idea. You could use the spare cycles from one of the other samples to do all the REC preparatory work. That would give you nearly the full 50 cycles just to prepare the REC, and on the next NMI a full 50 cycles just to perform the transfer. That makes much more sense.
So basically, the NMI fires and your playback hander routine runs. It grabs a byte from the buffer page at the current index and sends it to the DigiMAX for ouput. And you increment the index. Next, clear the upper 3 bits of the index (AND #%0001111) and check to see if the result is zero. If the result is zero, trigger the REU transfer without worrying about what or how much is being transferred where. It will transfer the next 32 bytes. And when that's done, return from the NMI.
The next time your handler processes the next NMI, it does much the same thing, it grabs a byte from the buffer at the current index, sends it to the DigiMAX for output. Increment the index, clear the upper 3 bits of the index, except this time the result is not zero so we won't be triggering an REU transfer. Instead compare it to 1. We find that it IS one, so we have a full 50 cycles to prepare the REU for the next transfer. I don't think that I have the time here to get into the details about how to program the REC chip. But, in short, you could read the index into the accumulator again, and this time shift it right 5 times, then use those shifted upper 3 bits as an index into a lookup table for where in the C64's main memory buffer page to transfer the memory to. The transfer source address inside the REU will always just be incremented by 32 bytes each time, and the transfer size is always fixed at 32 bytes. Set those values into the REC, but don't trigger the transfer yet, just return from the NMI. The transfer we just configured will be initiated the next time the index rolls back around to zero again, (or rather, to offset zero from the next 32-byte segment within the buffer page.)
I think that would do it! After talking about this so much, I'm kind of excited by it. I should sit down some evening and play around with this stuff in a little C64 OS application and see if I could make it work.
Alternative RAM Expansion Option: GEORAM
All of the above was for the 17xx family of REUs and their compatible clones. The 17xx REUs use the REC chip and the DMA transfers, but the other types of RAM expansion mentioned earlier don't necessarily work this way. As we talked about at the beginning, there is the SuperCPU with SuperRAM, RamLink DACC partitions, and GEORAM. I don't know a lot about the former two, and unfortunately at this time and in this context I just don't have time to get into those. Additionally, they are quite rare and no longer commercially available.
GEORAM on the other hand, I understand how it works perfectly. And, it is still commercially available from more than one source. It is listed in the Commodore 8-bit Buyer's Guide and even has a feature page. A 512KB version is available from Shareware plus for just £39.95. And a 1MB and a 4MB version are available (have to check to make sure these guys are still in stock) from GGLabs. The 1541 Ultimate II+ and Ultimate 64 also have a GEORAM option built in, which can be turned on or off and configured in its menu system.
Compact 512K GEORAM from Shareware Plus.
The REC chip is technically more sophisticated than the solution employed by the GEORAM, and its DMA capabilities can be used for other things, such as moving data from one area of C64 memory to another much faster than the CPU can. The recent hit game, Sam's Journey, for example, requires an REU to run on NTSC machines. It doesn't really use the external RAM, but it uses the REC chip to DMA transfer some areas of C64 memory to other places in main memory faster than would be possible without the REC. Despite all this, the GEORAM is simpler, more straightforward, and in other ways and for some use-cases actually preferable to a 17xx REU.
After thinking about and discussing the timing considerations of the DMA transfers, I'm now realizing that a GEORAM would be much easier to work with for digital audio playback. Here's how GEORAM works:
It's a cartridge that plugs into the expansion port, as you'd expect. But it makes use of both I/O areas. It takes over all of I/O 1 (therefore GEORAM is not compatible with any other hardware expansion that requires I/O 1 in whole or in part.) And it uses the last two bytes of I/O 2 ($DFFE and $DFFF) for write-only configuration registers. You write into $DFFE and $DFFF to specify which page of the GEORAM's extended memory should be accessible (readable and writable) via I/O 1. And that's it.
The entire page of I/O 1 ($DE00 to $DEFF) becomes a sliding window that looks into a much larger expanse of extended memory inside the GEORAM. First you would have to load data into GEORAM. Set the visible page, and load one page of data as normal from disk into main memory. Except not just to anywhere in main memory, you load it into $DExx. The data is being written into the visible page of the GEORAM. Write a new page address to $DFFE/$DFFF and the GEORAM slides the window such that what you wrote to $DExx is no longer visible. But you can repeat and load another page of memory from disk into $DExx, which is being saved into a different page of GEORAM. Repeat until you've loaded all the data in.
When it comes time to play back, the process is mostly the same. You have to calculate the number of cycles to count down between samples based on the sample rate of the data. If it's 11kHz (11025Hz) then (on NTSC) the number to plug into the CIA's interval timer is 93. Then you create the NMI playback handler routine just as before. However, the buffer that we were using above had been any arbitrary single page of main memory. For GEORAM your buffer will have to be the I/O 1 page itself. ($DE00 to $DEFF). Read one byte from that page at the current index, write it to CIA 2 Port B to output to the DigiMAX and increment the index. So far everything is essentially the same.
But now, when the index rolls over to zero (for the whole page, aka, without masking for 32-byte segments), instead of initiating a DMA transfer that would take at least 256 cycles to update the whole page, you simply write two new values in $DFFE/$DFFF. This can be done in just a few cycles, and the GEORAM slides the window. No actual memory needs to be copied from one place to another, and so the contents of I/O 1 are updated immediately. And not just immediately from the perspective of a cryogenically sleeping CPU. The CPU never needs to be paused. If you read from the $DE00 on the very next cycle, it will be routed to a different page inside the GEORAM and the result read back will be different, right away.
Wow! The GEORAM actually makes digital audio playback much simpler! But it's funny, I know of many digital audio programs that can and do make use of a 17xx REU, but I can't recall off the top of my head any that required or could use a GEORAM. (I'd love to be informed of some if there are any! There's gotta be something out there.)
Moving beyond 8-bit Mono 11kHzOther sample rates
At this point, it should be fairly clear what it means to move beyond 11kHz. There are other sample rates around and between 11 and 22kHz that are not unheard of. 8, 15 or 16kHz samples do exist. And you could always produce your own samples at those other rates, either by direct sampling from something like the 8-bit Stereo Sampler (Also by Digital Audio Concepts, and distributed by Shareware Plus.) Or you might find some software on a PC (or even write some on your Commodore 64) that could downsample something from 44.1kHz down to something like 16kHz, say.
Regardless. If you want to support a different sample rate, the two most important things to know are the precise clock rate of the computer (NTSC and PAL have different clock rates), and the exact number of samples per second of the audio data you want to play back. Divide and round to the nearest cycle, to know how many cycles there should be between samples. You don't have to do the division at runtime. You could opt to support a standard set of frequencies, say: 8kHz, 11kHz, 15kHz, 16kHz, 22kHz and 28kHz, on two different clock rates: NTSC and PAL, and just store the precomputed 12 possible values.
Use the correct value as the counter value in the CIA interval timer. If you're using a 17xx REU and the technique described earlier, the only thing to watch out for is that there be sufficient time to transfer the next 32-bytes of data by DMA, when playing at higher sample rates. It may be necessary to do more DMA transfers that are each smaller, say 16-bytes per transfer, when you get up to 22kHz and beyond.Other bit rates
What if the samples are not 8-bit, but 16-bit? Well, this really depends on your output playback device. A DigiMAX only supports 8-bit playback. If your playback hardware doesn't support it, the extra data in a 16-bit sample is really just a waste of disk space, and a waste of load time and CPU processing time. So, unless you have a 16-bit playback hardware device, you're best off just converting things to 8-bit on a PC before ever bringing them to your C64.
But let's say you do have a 16-bit audio file on your C64. Maybe it's on an SD Card that someone hands you, and you don't want to mess around with it on a PC. Or you downloaded one straight from the net. Or, you just want to be able to handle them properly for completeness sake. Nothing sucks more than having a sample, and having the C64 program either say: ERROR: I can't play this. Or maybe worse, have it play it but sound terribly broken. So, what should you do with 16-bit samples?
16-bit audio just means that each sample is composed of two adjacent bytes in the source data file. You have to know the endianness of the storage, that is, does the least significant byte come first or does it come second? Once you know this, the easiest thing to do is to read in two bytes, chuck the least significant byte, and save the most significant byte as though it were the only byte available. If you make this simple modification during load, it'll still take twice as long to load as an original 8-bit file, but you'll have effectively converted it to 8-bit on-the-fly. It will take up no extra space in memory, and you can play it back from memory just as though it was 8-bit data to begin with.
The samples are signed numbers. So in a 16-bit sample the most significant byte carries the sign. The most negative 16-bit signed number is -32,768, in binary: [1000 0000][0000 0000]. The brackets surround the most and least significant bytes. If you simply drop the low byte, and keep the high byte as a signed 8-bit number, it is still the most negative possible 8-bit number, -128, or 1000 0000 in binary. Conversely, the very largest positive 16-bit number is 32,767, in binary: [0111 1111][1111 1111]. Drop the lower byte and you are left with 0111 1111, which is the largest possible positive 8-bit signed number, 127. In other words, besides losing precision, you can convert a 16-bit number (signed or unsigned) to an 8-bit number of the same relative magnitude simply by discarding the low byte altogether.
I know of only one other trick that I learned a long time ago from Vanessa Dannenberg (creator of the DigiMAX and the 8BSS). The trick is a simple form of audio dithering to try to absorb some of the data from the least significant byte before you just toss it. Because, if you think about it for a second, if the low byte is really really close to rolling over, then the high byte is much closer to being +1 than to being +0 (which it would be if you just dropped the low byte.) As a first instinct you might think, okay, well, if the low byte is >127 we should add one to the high byte. Otherwise we'll leave the high byte unmodified. Unfortunately this does not lead to the best possible outcome. The cut off line is too crisp, it will result in an audible distortion that is difficult to describe in words. What you should do instead is a little more complicated, but it is still quite simple. Take the low byte, and add to it a random 8-bit number. If combined the two values overflow, then you should increment the high byte. If it does not overflow, you should leave the high byte unmodified.
When I first learned this trick, I thought it was very clever. I still feel like it's pretty clever. Even if the number in the low byte is quite large, like say, 200, there is still a chance that you'll pull a low random number, like 20, and combined 220 still doesn't overflow so you don't increment the high byte. Or it's also possible that a number that's quite small, like 30, will get randomly combined with a large number like 240, and overflow anyway. The randomness removes the sharp edge at exactly 127 to 128. However, it remains statistically the case that numbers above the halfway point are more likely to cause the high byte to go up by one. Taken on the whole, repeated over and over and over across thousands of samples, the affect on the audio is much smoother and sounds better. You can always test that out for yourself and take a listen.Stereo instead of mono
In stereo audio data, the left and right channels are interleaved. You have to know which channel comes first (otherwise it'll sound okay, but your left and right sides will be swapped.) If the samples are 8-bit, then one byte is for the left channel and the next byte for the right channel, and this simply flips back and forth all the way to the end of the data. If the samples are 16-bit, two bytes are a 16-bit sample for the left channel and the next two bytes are a 16-bit sample for the right channel.
If you're converting from 16-bit to 8-bit as described above, you can just do that procedure during load without regard for which channel each sample will ultimately be played out of. If the source is 16-bit stereo, you'll be converting to 8-bit stereo. But if the source is 16-bit mono, then the very same loading procedure will automatically have you converting to 8-bit mono.
But you do have a few options here, and a lot depends on your playback hardware.
Let's say you have a DigiMAX, (or 2 SIDs, but I've been avoiding talking about them, because I don't know how to output 8-bit digital audio from a SID, that's a whole other topic that I know nothing about.) In any case, DigiMAX supports stereo output and is easy to program. You can load all the samples into memory, just as if they were mono samples. But in the playback routine you need to do some extra work. Write to the DigiMAX control lines to choose the left channel DAC, read a sample byte from the buffer at the index, write to CIA 2 Port B. Increment the index. But before returning from the NMI, write immediately to the DigiMAX control lines to choose the right channel DAC, read the next byte from the buffer and write it to CIA 2 Port B, which will output it to the other channel. Increment the index. We'll come back to show some source code for this in a moment.
You'll rip through the memory buffer twice as fast, though, because you're reading and outputting 2 bytes on each NMI. And, you're incrementing the index twice on each NMI. When the masked index rolls back around to 0, you will still trigger the REU's DMA transfer. But one thing to watch out for, after outputting two samples per NMI, the index will never end on 1, which is when we were doing the preparatory work for the next REU transfer. In a stereo playback routine, it'll end up on 2, 4, 6, etc. Given the tight time constraints, there is very little time for the playback routine to check if this is stereo data or mono data and to branch to handle the specifics of each. I recommend writing independent playback routines for each combination of channels/playback-hardware/REU-type. In other words, one routine for DigiMax in Stereo with a 17xx REU, one routine for DigiMax in Mono with a 17xx REU (you might consider outputting the same sample twice, once to both channels in this routine,) one routine for DigiMAX in Stereo with a GEORAM, and one routine for DigiMAX in mono with a GEORAM. Additional routines for other hardware types, like the mono or stereo SIDs. Pick and assign the correct playback routine as the NMI handler right at the beginning when you load in the audio data and are ready to play it. If the number of playback routines proliferates out of control, there might be some sense in composing a playback routine. Aka, having your code write code that combines the relevant elements into a custom streamlined playback routine.Mixing stereo to mono
There are other things to think about when it comes to handling stereo data if you only have mono playback hardware.
You could choose to playback only one channel. You could do this at load time, as you load you could toss every second sample. The result is that you'd have in memory only mono data of only one channel. And then just play it back as you would any standard mono data. But then if you wanted to hear the other channel you'd have to load the whole thing from disk again, and choose the other channel. You might think, oh man, why would you EVER want only one channel? Easy. Sometimes the original source of the recording was only mono, but in PC-land where nobody cares about file size or load times, both mono channels got saved into a stereo file format. In other words, even though there are two channels on the disk, it is possible that they both contain the same audio information. If that happens to be the case, you could save memory and CPU time (though unfortunately not load time) by simply tossing every second sample during load.
More likely though both channels contain meaningful but different data. If you only have mono playback hardware, and you don't want to lose some of that meaning3, then at some point you need to mix the two channels to mono. Mixing two channels of audio data is actually very straightforward, but it does involve some computation, and will take some cycles per sample, and so this needs to be taken into consideration when you think about the advantages and disadvantages of when to do the mixing.
Mixing two audio samples if very easy, that's the good news. It is simply the mathematical average of the two sample byte values. How do you get an average of two numbers? Add them together and divide by two. Now let's talk about some gotchas of doing this on a computer. Let's say the samples are 8-bit each. An 8-bit number has a maximum range of 0 to 255. Imagine you have two numbers that are both less than 128. Add them together, and they don't overflow 255. Dividing by two in binary is a simple right shift. Let's look at an example:
First in decimal: 65 ;sample A +29 ;sample B --- 94 ;sum of A + B /2 ;divide by 2 --- 47 ;mathematical average of A and B Next in binary: 0100 0001 ;sample A (65) +0001 1101 ;sample B (29) ---------- 0101 1110 ;sum of A + B (94) ---------- 0010 1111 ;Shifted Right to divide by 2 (47)
Easy peasy. But what happens if both values combined total to more than 255? Aka, what happens if the addition overflows? You might be tempted to say, I know what to do, I'll just divide each number by two first, and then add them together. That will make an overflow impossible. But let's see what happens in binary if we do that with the above example.
0100 0001 ;sample A (65) 0010 0000 ;Shifted Right (32) 0001 1101 ;sample B (29) 0000 1110 ;Shifted Right (14) 0010 0000 ;sample A/2 (32) +0000 1110 ;sample B/2 (14) ---------- 0010 1110 ;sum of A/2 + B/2 (46)
In the first example the result was 47, but in the second example the result was 46. The point is, there was a loss of precision. But actually it's better than that. It turns out, you don't really need to worry about an overflow, because of the carry! The carry is an amazing little helper that allows calculations on a CPU of any bit size to do math on numbers of a greater bit size. The issue isn't limited to 8-bit CPUs, of course, even if you had a 16-bit CPU, a carry is what allows it to do 32-bit math. An 8-bit CPU can do math in 16-bit, 32-bit or more because of the carry. For this particular use-case though, we don't really need to do full 16-bit math. Using the carry (even if only in an ephemeral way) allows us to do 9-bit math. 2 ^ 9 = 512. Our numbers can range from 0 to 511. Let's see how this works by averaging two 8-bit numbers whose sum exceeds 255.
First in decimal: 201 ;sample A +199 ;sample B ---- 400 ;sum of A + B /2 ;divide by 2 ---- 200 ;mathematical average of A and B Next in binary: 1100 1001 ;sample A (201) +1100 0111 ;sample B (199) ----------- C,7654 3210 ;Carry,Accumulator bits 1 1001 0000 ;sum of A + B (400) ----------- 0 1100 1000 ;Roll Right to divide by 2 (200)
The point is that when you add two numbers with ADC, it adds with carry. So you need to make sure the carry is clear before you add, but if the addition overflows the 8-bit range, the extra bit will go into the carry. When it comes time to divide by two, instead of doing an LSR (a logical shift right) which always brings a zero into bit 7, instead you use a single ROR (roll right). This pulls the state of the overflow (from the carry) into bit 7 of the result. This is effectively 9-bit math. The result is 200, which is indeed the average of 201 and 199.
The question now becomes, when should the mixing be done? Imagine that you have a DigiMAX, which means you have hardware stereo playback capabilities. To take advantage of the hardware, you would load samples from both channels into memory and use a stereo playback routine that outputs two samples on each NMI, alternately at two different DACs. Fine, just as we discussed earlier. But now imagine that without loading anything anew, without changing what you already have in memory, you'd love to just see how this all sounds when mixed to mono and played out a single SID. All you really have to do is select (or dynamically compose) a playback routine that includes the mixing in realtime as it handles each sample.
Let's take a super brief detour to talk about how data is written to the DigiMAX. Here's a picture of its PCB, and I've labeled the relevant edge connections. The black labels are bottom side, the white labels are topside. In each label the first listed is the CIA 2 pin, and the second listed is the pin on the TLC7226 (Quad 8-bit DAC chip.)
DigiMAX PCB labeled pinouts.
The CIA 2's Port B is wired directly to the DAC Chip's 8 data lines. So that part is very simple. The next thing to notice is that the DAC Chip's /WR line is connected to the CIA's PC line. I had to look this up to know how that PC line works. It's there for hardware handshaking, and it is pulled low, automatically, for one clock cycle when you write anything to CIA Port B. In other words, the DAC Chip doesn't just poll its own data lines. It knows that you have sent it data when you pull its "write" line low. But this happens automatically, simply by writing a byte of data to Port B. Additionally, it directs the byte that you've written to the DAC that is addressed by its two address lines, A0 and A1. These are connected to bits 2 and 3 of CIA 2's Port A.
CIA 2's Port A is used by other things too, so you have to be careful only to change bits 2 and 3. Bits 0 and 1 configure the VIC II's memory bank. And bits 3, 4, 5, 6 and 7 control the IEC serial bus. Bit 3 is overlapped with the serial bus's ATN line. So, we should read from Port A, AND to clear 2 and 3, OR to set 2 and 3 to choose the DAC and write the value back to Port A. Then write your data byte to Port B, and repeat. If you know what the VIC bank should be, you could speed this up by just repeatedly setting Port A, with the bits 4, 5, 6 and 7 (serial bus) all set to zero, bits 0 and 1 set to the VIC II bank you know it's already using, and then just worry about bits 2 and 3 for the DAC address. Let's go with that, and just assume the VIC II is looking at Bank 3.
Now let's compare NMI playback handling code between full stereo to a DigiMAX and stereo mixed to mono to a SID:
First full stereo to DigiMAX: LDA #%00000011 ; 2 cycles, DAC 0, VIC II Bank 3 STA $DD00 ; 4 cycles, Port A LDX index ; 4 cycles LDA buffer,X ; 4 cycles STA $DD01 ; 4 cycles, Port B LDA #%00001011 ; 2 cycles, DAC 2, VIC II Bank 3 STA $DD00 ; 4 cycles, Port A INX ; 2 cycles LDA buffer,X ; 4 cycles STA $DD01 ; 4 cycles, Port B STX index ; 4 cycles ----------------------- 38 cycles Now stereo mixed to mono, to a SID: LDX index ; 4 cycles LDA buffer,X ; 4 cycles INX ; 2 cycles CLC ; 2 cycles ADC buffer,X ; 4 cycles ROR A ; 2 cycles STA sidaddr ; 4 cycles (not sure how digis with SID works, sorry.) STX index ; 4 cycles ----------------------- 26 cycles
Wow! I was not expecting that. Mixing is surprisingly cheap. Assuming that this is all you have to do to write a digi to the SID, it actually takes fewer cycles to mix two channels into one, than it does to address an alternative DAC and write the second sample to its own output.
In any case, playing back stereo content either in stereo to a DigiMAX or mixing to mono in realtime both take more cycles than just reading and playing a mono sample. Depending on the frequency (higher frequencies will certainly decrease the amount of CPU time you have to play around with) and also depending on how much free CPU time you want to have to process user input and update the screen while the sample is playing back, you could alternatively do the mixing as part of the loading. The math for mixing is exactly the same, but when you're loading samples from the file, you could mix them and then only write the mixed byte to memory as a mono sample.
If you have no stereo playback hardware, it might make more sense to mix to mono at load time, and just store the mono data in memory. But if you have stereo hardware and you may want to just test out what it sounds like in mono, you could preserve the stereo samples in memory and optionally mix to mono in realtime during playback.
Well take care Greg you have been so busy I have been trying to get ahold of ya for quite a few years, good to have you back in the Commodore scene.
Thanks for writing to me Terry. I did in fact take an 8 year hiatus from the scene. But I'm back now. And I'm having a great time with my C64.
I know the above is a lot of information to absorb. But hopefully other people will find it and be able to make some use of the information. It makes more sense for me to write up the details of this stuff and put it into a blog post for everyone to gain access to, than it does to write a series of short private emails.
Bye for now!
- Anything on I/O 2 can be remapped with special port expanders to I/O 1 ($DExx), and vice versa, if the software is able to make use of it there.
- DMA is the fastest possible way to move data into or out of the C64's main memory. And it can also be used to write to other C64 I/O devices, such as VIC, SID or CIA registers.
- Think about a conversation with two people sitting at their own mics, person A speaks on the left channel only, person B speaks on the right channel only, you wouldn't want to end up with only half the conversation!
Do you like what you see?
You've just read one of my high-quality, long-form, weblog posts, for free! First, thank you for your interest, it makes producing this content feel worthwhile. I love to hear your input and feedback in the forums below. And I do my best to answer every question.
I'm creating C64 OS and documenting my progress along the way, to give something to you and contribute to the Commodore community. Please consider purchasing one of the items I am currently offering or making a small donation, to help me continue to bring you updates, in-depth technical discussions and programming reference. Your generous support is greatly appreciated.
Greg Naçu — C64OS.com