6502 Emulator

TL;DR: The how and why of the new 6502 emulator in the chips project.

I wrote a new version of my 6502/6510 emulator in the last weeks which canbe stepped forward in clock cycles instead of full instructions.

6502 on Minecraft. There's a 6502 emulator in Minecraft that runs Forth, which is further discussed here. 25c3: The Ultimate Commodore 64 Talk discusses the Commodore 64 in detail, and includes a discussion on the 6502 (really 651) machine code. Monster 6502 is rebuilding a 6502 but much larger, so you can see it working (amazing. SYM-1/69 Mod Board (Replaces 6502 with 6809) Photo Synertek KTM-2 Terminal. Schematic Saturn Software Limited. SK-FORTH79 User's Guide and Glossaries; SK-FORTH79 Source Listing; SYM-PASCAL User's Guide; X-RAY (Extended RAE) User's Guide and Listing SYMulator: The SYM-1 Emulator. SYMulator - is a working SYM-1 simulator.

The actual work didn’t take a couple of weeks, more like a few evenings anda weekend, because the new emulator is more or less just a different output of the code-generation python script which I keep mutating for each newemulator version.

But while working on the new emulator I got a bit side-tracked by anotherproject:

But this is (maybe) the topic of another blog post.

Before I’m getting to the cycle-stepped 6502 emulator, a little detourinto 8-bit CPUs and CPU emulators in general:

What CPUs actually do

The general job of a CPU (no matter if old or new) is quite simple: givensome memory filled with instructions and data, process instructions one afteranother to change some of the data in memory into some other data, and repeatthis process forever. Everything else in a typical computer just exists toturn some of the data in memory into pretty pixels and sounds, or sendthe data to another computer.

The processing of a single instruction can be broken down into several steps:

  • fetch the instruction opcode from memory (on 8-bit CPUs this is thefirst - and sometimes only - byte of an instruction)
  • ‘decode’ this opcode to decide what actions must be taken to‘execute’ the instruction, and then step through those actions, like:
  • load additional data from memory into the CPU
  • change the data loaded into the CPU in some way (arithmetics, bit twiddling, etc)
  • store data from the CPU back to memory
  • repeat those steps for the next instruction

Apart from simple data manipulation, this basic ‘core execution loop’is also used for more interesting control-flow actions:

  • branches: Jump to a different memory location and continue executinginstructions from there, this is like a goto in higher level languages
  • conditional branches: Based on the result of a previous computation, either jump toa different memory location, or continue with the instruction directly following thebranch instruction. Conditional branches are the assembly-level building blocks of higher-levellanguage constructs like if(), for() and while()
  • subroutine calls: Store the current execution location on the ‘stack’(a special area in memory set aside for storing temporary values), jump to adifferent memory location (the subroutine), and at the end of that subroutine(indicated by a special ‘return’ instruction), load the stored executionlocation from the stack and continue execution right after the originalsubroutine call. This is like a function call in a high level language.
  • interrupts: Interrupts are like subroutine calls except they areinitiated by an external hardware event (such as a hardware timer/counterreaching zero, a key being pressed, or the video signal reaching a specificraster line). The CPU stops whatever it is currently doing, jumps into aspecial ‘interrupt service routine’, and at the end of the service routine,continues with whatever it did before.

CPUs usually have a few special registers to deal with control flow:

  • The current execution location (where the next instruction is loaded from)is stored in the Program Counter (PC) register.
  • The current stack location is stored in the Stack Pointer (S or SP) register.
  • A Flags or Status Register (for some mysterious reason called ‘P’ on the 6502)which stores (mostly) information about the result of previous arithmeticoperations to be used for conditional branching. For instance: if the lastinstruction subtracted two numbers and the result is 0, the Flags Registerwill have the Zero Flag bit set. The conditional branchinstructions BEQ (branch when equal) and BNE (branch when not equal) look atthe current state of the Zero Flag to decide whether to branch or not (thereare other Flag bits and associated branch instructions too, but you get the idea).

It’s interesting to note that the branch instructions of a CPU are essentiallynormal load instructions into the PC register, since that’s what they do: loadinga new location value into the PC register so that the next instruction is fetchedfrom a different memory location.

The components of an 8-bit CPU

8-bit CPUs like the Z80 or 6502 are fairly simple from today’s pointof view: a few thousand transistors arranged in a few logical blocks:

A Register Bank with somewhere between a handful and a dozen 8- and 16-bitregisters (the 6502 has about 56 bits worth of ‘register storage’, while theZ80 has over 200 bits worth of registers)

An ALU (Arithmetic Logic Unit) which implements integer operations likeaddition, subtraction, comparison (usually done with a subtraction anddropping the result), bit shifting/rotation, and the bitwise logic operationsAND, OR and XOR.

The Instruction Decoder. This is where all the ‘magic’ happens. Theinstruction decoder takes an instruction opcode as input, and in thefollowing cycles runs through a little instruction-specific hardwired programwhere each program step describes how the other componentsof the CPU (mainly the register bank and ALU) need to be wired together andconfigured to execute the current instruction substep.

How exactly the instruction decoder unit is implemented differs quite a lotbetween CPUs, but IMHO the general mental model that each instruction is asmall hardwired ‘micro-program’ of substeps which reconfigures the data flowwithin the CPU and the current action of the ALU is (AFAIK) valid for allpopular 8-bit CPUs, no matter how the decoding process actually happens indetail on a specific CPU.

And finally there’s the ‘public API’ of the CPU, the input/output pinswhich connect the CPU to the outside world.

Most popular 8-bit CPUs look fairly similar from the outside:

  • 16 address bus output pins, for 64 KBytes of directly addressable memory
  • 8 data bus in/out pins, for reading or writing one data byte at a time
  • a handful of control- and status-pins, these differ betweenCPUs, but the most common and important pins are:
    • one or two RW (Read/Write) output pins to indicate to the outside world whether the CPU wants to read data from memory (with the data bus pins acting as inputs), or write data into memory (data pins as outputs) at the location indicated by the address bus pins
    • IRQ and NMI: These are input pins to request a “maskable” (IRQ) or non-maskable (NMI) interrupt.
    • RES: an input pin used to reset the CPU, usually together with the whole computer system, this puts the CPU back into a defined starting state

Of course the CPU is only one part of a computer system. At least some memoryis needed to store instructions and data. And to be any usefulthere must be a way to get data into and out of the computer (keyboard, joystick, display, loud speakers, and a tape- or floppy-drive). These devices are usually not controlled directly by the CPU, but by additionalhelper chips, for instance in the C64:

  • 2 ‘CIA’ chips to control the keyboard, joysticks, and tape drive
  • the ‘SID’ chip to generate audio
  • and the ‘VIC-II’ chip to generate the video output

The C64 is definitely on the extreme side for 8-bit computers when it comesto hardware complexity. Most other 8-bit home computers had a much simplerhardware configuration and were made of fewer and simpler chips.

All these chips and the CPU are connected to the same shared address- and data-bus,and some additional ‘address decoder logic’ (and clever engineeringin general) was needed so that all those chips only use theshared address and data bus when it’s their turn, but diving in theregoes a bit too far for this blog post :)

Back to emulators:

How CPU emulators work

While all CPU emulators must somehow implement the tasks of real CPUs outlinedabove, there are quite a few different approaches how they reach that goal.

The range basically goes from “fast but sloppy” to “accurate but slow”. Whereon this range a specific CPU emulator lives depends mainly on the computer systemthe emulator is used for, and the software that needs to run:

  • Complex general-purpose home computers like the C64 and Amstrad CPC with anactive and recent demo scene need a very high level of accuracy down to theclock cycle, because a lot of frighteningly smart demo-sceners are exploring (andexploiting) every little quirk of the hardware, far beyond what the originalhardware designers could have imagined or what’s written down in theoriginal system specifications and chip manuals.

  • The other extreme are purpose-built arcade machines which only need to runa single game. In this case, the emulation only needs to be ‘good enough’ torun one specific game on one specific hardware configuration, and a lot ofshortcuts can be taken as long as the game looks and feels correct.

Here’s the evolution from “fast but sloppy” to “slow but accurate” CPUemulators that I’ve gone through, I’m using the words “stepped” vs “ticked”in a very specific way:

  • “stepped” means how the CPU emulator can be “stepped forward” from the outsideby calling a function to bring the emulator from a “before state” into an “afterstate”
  • “ticked” means how the entire emulated system is brought from a “before state”into an “after state”

Instruction-Stepped and Instruction-Ticked

This was the first naive implementation when I started writing emulators, aZ80 emulator which could only be stepped forward and inspectedfor complete instructions.

The emulator had specialized callback functions for memory accesses, andthe Z80’s special IO operations, but everything else happening “inside”an instruction was completely opaque from the outside.

This means that an emulated computer system using such a CPU emulator could only“catch up” with the CPU after a full instruction was executed, which on theZ80 can take anywhere between 4 and 23 clock cycles (or even more when aninterrupt is handled).

This worked fine for simple computer systems which didn’t need cycle-accurateemulation, like the East German Z- and KC-computers. But 23..30 clock cyclesat 1 MHz is almost half of a video scanline, and once I moved to more complexsystems like the Amstrad CPC it became necessary to know when exactly withinan instruction a memory read or write access happens. The Amstrad CPC’svideo system can be reprogrammed at any time, for instance in the middle of ascanline, and when such a change happens at the wrong clock cycle within ascanline, things start to look wrong, from subtle errors like a few missingpixels or wrong colors here and there to a completely garbled image.

Instruction-Stepped and Cycle-Ticked

The next step was to replace the specialized IO and memory access callbackswith a single universal ‘tick callback’, and to call this from within theCPU emulation for each clock cycle of an emulated instruction. I moved to thisemulation model for two reasons: (a) because I had to improve the Amstrad CPCvideo emulation, and (b) because I started with a 6502 CPU emulator where a memoryaccess needs to happen in each clock cycle anyway.

From the outside, it’s still only possible to execute instructions as a whole.So calling the emulator’s ‘execute’ function once will run through the entirenext instruction, calling the tick callback several times in turn, once for eachclock cycle. It’s not possible to tell the CPU to only execute one clock cycle,or suspend the CPU in the middle of an instruction.

With this approach, the CPU takes a special, central place in an emulatedsystem. The CPU is essentially the controller which ticks the system forwardby calling the tick callback function, and the entire remaining system isemulated in this tick callback. This allows a perfectly cycle accurate systememulation, and as far as I’m aware, this approach is used in most ‘serious’emulators, and to be honest, the reasons to use an even finer-grainedapproach are a bit esoteric:

Cycle-Stepped and Cycle-Ticked

This is where I’m currently at with the 6502 emulation (but not yet with theZ80 emulation).

Instead of an “exec” function which executes an entire instruction, there’s nowa “tick” function which only executes one clock cycle of an instruction and returnsto the caller. With this approach a ‘tick callback’ is no longer needed, andthe CPU has lost it’s special ‘controller status’ in an emulated system.

Instead the CPU is now just one chip amongst many, ticked forward like all theother chips in the system. The ‘system tick’ function has inverted its role:

Instead of the system tick function being called from inside the CPUemulation, the CPU emulation is now called from inside the system tickfunction. This makes the entire emulator code more straightforward andflexible. It’s now ‘natural’ and trivial to tick the entire system forward insingle clock cycles (yes, the C64 emulation now has a c64_tick() function).

Being able to tick the entire system forward in clock cycles is very usefulfor debugging. So far, the step-debugger could only step one complete CPUinstruction at a time.

That’s good enough for debugging CPU code, but not so great for debugging theother chips in the system. Each debug step would skip over several clockcycles depending on what CPU instruction is currently executed. But now thatthe CPU can be cycle-stepped, implementing a proper ‘cycle-step-debugger’ isfairly trivial.

Another situation where a cycle-stepped CPU is useful is multi-processorsystems, which actually weren’t that unusual in the 80’s: the Commodore 128had both a 6502 and Z80 CPU (although they couldn’t run at the same time),some arcade machines had sound and gameplay logic running on different CPUs,and even attaching a floppy drive to a C64 turns this into a multiprocessorsystem, because the floppy drive was its own computer with its own 6502 CPU.

With a cycle-stepped CPU, it’s much more straightforward now to writeemulators for such multi-CPU systems. If required, completely unrelatedsystems can now run cycle-synchronized without weird hacks in the tickcallbacks or variable-length ‘time slices’.

The new 6502 emulator

The new emulator only has two ‘interesting’ functions for initializinga new 6502 instance and ticking it:

Like my older emulators, a 64-bit integer is used to communicate the pin statusin and out of the CPU. What’s new is that the m6502_init() function returnsan initial pin mask to ‘ignite’ the CPU startup process, which must be passedinto the first call to m6502_tick().

The first 7 calls to m6502_tick() will then run through the 7-cyclereset-sequence, only then will the CPU emulation start to run ‘regular’instructions.

The code example above still doesn’t do anything useful yet, because wehaven’t connected the CPU to memory, it’s a bit similar to connecting a real6502 to a clock signal, but leaving all the other pins floating in the air.

Let’s fix that:

…and that’s it basically, let’s try a real C program which loads a specificvalue into the A register:

Running this program should print a A: 33 to the terminal.

Creating a complete home computer emulator from this is now “just” a matterof making the part after m6502_tick() where the pin mask is inspectedand modified more interesting. Instead of just reading and writing memory,the other system chip emulations are ticked, and some sort of ‘address decoding’needs to take place so that memory accesses to specific memory regions arererouted to access custom chip registers instead of memory.

Implementation Details

The new emulator uses a “giant switch case”, just like the old emulator. Butinstead of one case-branch per instruction, the giant-switch-case has been“unfolded” to handle one clock cycle of a specific instruction per case branch.

What looked like this before to handle the complete LDA # instruction:

…looks like this now:

The LDA # instruction costs 2 clock cycles, so there’s two case-branches.

The first case:

…puts the PC registerinto the address bus pins (with the macro _SA(), ‘set address’),increments the PC, and returns to the caller.

The caller inspects the pins, sees that it needs to do a read access from theaddress at PC - which in case of the LDA # instruction would be theimmediate byte value to load into the A register - puts this value frommemory into the data bus pins, and calls m6502_tick() again with the new pin mask.

Inside m6502_tick(), the next case branch is executed:

…this takes the data bus value (the _GD() macro, or ‘get data’) and puts it into the A register (c->A). Next it checks whether the Z(ero) Flag needs to be set with the _NZ() macro, and finally ‘calls’ the _FETCH() macroto initiate the next instruction fetch.

How instruction fetch works

Note how we conveniently skipped the whole process of fetching the opcode bytefor the LDA # instruction above, this is because the instruction fetchingprocess is like the snake which bites its own tail. A new instruction cannotbe fetched without finishing a previous instruction:

The _FETCH macro at the end of each instruction decoder switch-caseblock doesn’t actually load the next opcode byte, it only manipulates the pinmask to let the ‘user code’ do this.

All that the _FETCH macro actually does it put the current PC into theaddress bus pins, and set the SYNC pin to active (and importantly: theRW pin is set to signal a memory read, this happens automatically foreach tick, unless it’s a special write cycle.

After the m6502_tick() function returns with the _FETCH pin mask active, thecaller’s memory access code ‘sees’ a normal READ access with the current PCas address, this loads the next instruction byte into the data bus pins.

The SYNC pin isn’t interesting for the caller, but it signals the start ofa new instruction to the next call of m6502_tick():

At the start of the m6502_tick() function, and before the giantinstruction-decoder switch-case statement, the incoming pin mask is checkedfor specific control pins, among others the SYNC pin. An active SYNC pinsmarks the start of a new instruction, with the instruction’s opcode alreadyloaded into the data bus pin.

When the SYNC pin is active, the current data bus value is loaded into aninternal IR (Instruction) register, much like in a real 6502, but with alittle twist:

The actual opcode value is shifted left by 3 bits to ‘make room’ for a 3 bit cycle counter (3 bits because a 6502 instruction can be at most 7cycles long). This merged ‘opcode plus cycle counter’ is all theinstruction decoder state needed to find the proper handler code ( case branch)for the current cycle of the current instruction.

The instruction fetch and decoding process looks like this inside them6502_tick() function:

It is interesting to note that this somewhat weird ‘overlapped’ instructionfetch process, where the last processing cycle of an instruction overlapswith fetching the next instruction opcode is exactly like it works on a real 6502 CPU. In the cycle-stepped emulator this overlapping happened‘naturally’, there’s no other way to make this work while at the sametime having correct instruction cycle counts :)

How interrupt handling works

…also much like in a real 6502: Somewhat simplified, at the start ofthe m6502_tick() function where the SYNC pin is tested, it is also checkedwhether an interrupt needs to be handled (interrupts are handled onlyat the end of an instruction, and before the next instruction starts).

If the ‘interrupt condition’ is true, the next regular opcode byte (which isalready loaded into the data bus pins at this point) is discarded, and insteadthe IR register is loaded with the BRK instruction opcode, together withan internal flag that this isn’t a regular BRK, but a special ‘interrupt BRK’.

From here on, the normal instruction decoding process continues. The giant-switch-caseis called and ends up in the handling code for the BRK instruction, and thishas some special-case handling for a ‘BRK running as interrupt handler’.

This is also pretty much identical to how it works in a real 6502, bothmaskable and non-maskable interrupts, and the reset sequence are actually runningthrough the BRK instruction decoding process, with some ‘special case tweaks’.

Of course it can’t ever be quite as simple as that, the 6502 has quite a few interrupt-related quirks. For instance, the interrupt condition is only detected up to two cycles before the end of an instruction, any laterand the interrupt is delayed until the end of the next instruction. But eventhis isn’t consistent, as there is a ‘branch quirk’ in conditional brancheswhere this ‘2 cycle delay rule’ isn’t true but reduced to a 1-cycle delay.

Another interesting ‘feature’ is interrupt hijacking. If a maskableinterrupt is detected, but in the few cycles between the detection and themiddle of the BRK instruction which handles this interrupt a higher-prioritynon-maskable interrupt is detected too, than the maskable interrupt is‘hijacked’ and finishes as a non-maskable interrupt.

The whole thing gets more complicated because similar exceptions alsoexist in the external chips which trigger interrupts (in a C64: the twoCIAs and the VIC-II), so getting the interrupt handling in a completeemulated system cycle-accurate under all conditions can be a bit challenging(and it’s also not completely accurate in my emulators yet, although I’msomewhat convinced that at least the CPU side is correct).

Conclusion: Links, Demos and Test Coverage

The new 6502 emulator can be seen in action in the C64 and Acorn Atomemulators here:

Currently there’s no way to actually do cycle-stepping in the debuggerthough, this will be added at a later time.

The 6502 source code is here:

…which is code-generated from these two files:

The C64 system emulator source code is here:

There’s also quite a few new C64 demo-scene demos on the Tiny Emulators mainpage. Those demos give a good impressionabout the overall emulation quality, since they often use the hardware ininteresting ways and have strict timing requirements.

The C64 emulation is now “pretty good but still far from perfect”, while theCPU and CIA accuracy are much better now, the VIC-II emulation leaves a lotto be desired (this will be my next focus for the C64 emulation, but I’llmost likely spend a bit of time with something else first).

Here are some C64 demos which still show various rendering artefacts:

Some other demos even get stuck, which is most likely related to the VIC-IIraster interrupt firing at the wrong time, but as I said, the VIC-IIemulation quality will be my next focus on the C64.

The CPU and CIA emulation are nearly (but not quite) perfect now, there’s oneproblem related to the CIA-2 and non-maskable interrupts which I wasn’t ableto track down yet.

Here’s an overview of the conformance tests I’m currently using, and theirresults:

All NEStest CPU tests are succeeding, these are fairly “forgiving” high leveltests which only test the correct behaviour and cycle durationof documented instructions without the BCD mode.

All Wolfgang Lorenz tests are succeeding, except:

  • the nmi test has one failed subtest
  • the four CIA realtime clock related tests are failing because my currentCIA emulation doesn’t implement the realtime clock

The Wolfgang Lorenz test suite covers things like:

  • all instructions doing the “right thing” and taking the right number of clock cycles, including all undocumented and unintended instructions, and the BCD arithmetic mode
  • various ‘branch quirks’ when branches go to the same or a different 256 byte page
  • correct timing and behaviour of interrupts, including various interrupt related quirks both in the CPU and CIAs (like NMI hijacking, both CIAs requesting interrupts at the same time, various delays when reading and writing CIA registers etc)
  • various behaviour tests of the 6510’s IO port at address 0 and 1
  • and some other even more obscure stuff

The big area that’s not covered by the Wolfgang Lorenz test suite is theVIC-II, for this I’ve started to use tests from the VICE-emulator testbench now. But thesetests are “red” all over the place at the moment, and it’s not worth yetwriting about them :)

A new automated testing method I’m using for the cycle-stepped 6502 emulationis perfect6502 this is a C portof the transistor-level simulation from visual6502.

Perfect6502 allows me to compare the state of my 6502 emulation against the‘real thing’ after each clock cycle, instead of only checking the result ofcomplete instructions. The current test isn’t complete, it only checks thedocumented instructions, and some specific interrupt request situations. Butthe point of this test is that it is trivial now to create automated testswhich allow checking of specific situations against the real transistor-levelsimulation.

6502 Emulator

Introduction

In this tiny ebook I’m going to show you how to get started writing 6502assembly language. The 6502 processor was massive in the seventies andeighties, powering famous computers like theBBC Micro,Atari 2600,Commodore 64,Apple II, and the Nintendo EntertainmentSystem. Bender inFuturama has a 6502 processor for abrain. Even theTerminator was programmed in6502.

So, why would you want to learn 6502? It’s a dead language isn’t it? Well,so’s Latin. And they still teach that.Q.E.D.

(Actually, I’ve been reliably informed that 6502 processors are still beingproduced by Western Design Centerand sold to hobbyists, so clearly 6502isn’t a dead language! Who knew?)

Seriously though, I think it’s valuable to have an understanding of assemblylanguage. Assembly language is the lowest level of abstraction in computers -the point at which the code is still readable. Assembly language translatesdirectly to the bytes that are executed by your computer’s processor.If you understand how it works, you’ve basically become a computermagician.

Then why 6502? Why not a useful assembly language, likex86? Well, I don’t think learning x86 isuseful. I don’t think you’ll ever have to write assembly language in your dayjob - this is purely an academic exercise, something to expand your mind andyour thinking. 6502 was originally written in a different age, a time when the majority ofdevelopers were writing assembly directly, rather than in these new-fangledhigh-level programming languages. So, it was designed to be written by humans.More modern assembly languages are meant to written by compilers, so let’sleave it to them. Plus, 6502 is fun. Nobody ever called x86 fun.

Our first program

So, let’s dive in! That thing below is a little JavaScript 6502 assembler andsimulator that I adapted for this book.Click Assemble then Run to assemble and run the snippet of assembly language.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

Hopefully the black area on the right now has three coloured “pixels” at thetop left. (If this doesn’t work, you’ll probably need to upgrade your browser tosomething more modern, like Chrome or Firefox.)

So, what’s this program actually doing? Let’s step through it with thedebugger. Hit Reset, then check the Debugger checkbox to start thedebugger. Click Step once. If you were watching carefully, you’ll havenoticed that A= changed from $00 to $01, and PC= changed from $0600 to$0602.

Any numbers prefixed with $ in 6502 assembly language (and by extension, inthis book) are in hexadecimal (hex) format. If you’re not familiar with hexnumbers, I recommend you read the Wikipediaarticle. Anything prefixed with #is a literal number value. Any other number refers to a memory location.

Equipped with that knowledge, you should be able to see that the instructionLDA #$01 loads the hex value $01 into register A. I’ll go into moredetail on registers in the next section.

Press Step again to execute the second instruction. The top-left pixel ofthe simulator display should now be white. This simulator uses the memorylocations $0200 to $05ff to draw pixels on its display. The values $00 to$0f represent 16 different colours ($00 is black and $01 is white), sostoring the value $01 at memory location $0200 draws a white pixel at thetop left corner. This is simpler than how an actual computer would outputvideo, but it’ll do for now.

So, the instruction STA $0200 stores the value of the A register to memorylocation $0200. Click Step four more times to execute the rest of theinstructions, keeping an eye on the A register as it changes.

Exercises

  1. Try changing the colour of the three pixels.
  2. Change one of the pixels to draw at the bottom-right corner (memory location $05ff).
  3. Add more instructions to draw extra pixels.

Registers and flags

We’ve already had a little look at the processor status section (the bit withA, PC etc.), but what does it all mean?

The first line shows the A, X and Y registers (A is often called the“accumulator”). Each register holds a single byte. Most operations work on thecontents of these registers.

SP is the stack pointer. I won’t get into the stack yet, but basically thisregister is decremented every time a byte is pushed onto the stack, andincremented when a byte is popped off the stack.

PC is the program counter - it’s how the processor knows at what point in theprogram it currently is. It’s like the current line number of an executingscript. In the JavaScript simulator the code is assembled starting at memorylocation $0600, so PC always starts there.

The last section shows the processor flags. Each flag is one bit, so all sevenflags live in a single byte. The flags are set by the processor to giveinformation about the previous instruction. More on that later. Read moreabout the registers and flags here.

Instructions

Instructions in assembly language are like a small set of predefined functions.All instructions take zero or one arguments. Here’s some annotatedsource code to introduce a few different instructions:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

Assemble the code, then turn on the debugger and step through the code, watchingthe A and X registers. Something slightly odd happens on the line ADC #$c4.You might expect that adding $c4 to $c0 would give $184, but thisprocessor gives the result as $84. What’s up with that?

The problem is, $184 is too big to fit in a single byte (the max is $FF),and the registers can only hold a single byte. It’s OK though; the processorisn’t actually dumb. If you were looking carefully enough, you’ll have noticedthat the carry flag was set to 1 after this operation. So that’s how youknow.

In the simulator below type (don’t paste) the following code:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

An important thing to notice here is the distinction between ADC #$01 andADC $01. The first one adds the value $01 to the A register, but thesecond adds the value stored at memory location $01 to the A register.

Assemble, check the Monitor checkbox, then step through these threeinstructions. The monitor shows a section of memory, and can be helpful tovisualise the execution of programs. STA $01 stores the value of the Aregister at memory location $01, and ADC $01 adds the value stored at thememory location $01 to the A register. $80 + $80 should equal $100, butbecause this is bigger than a byte, the A register is set to $00 and thecarry flag is set. As well as this though, the zero flag is set. The zero flagis set by all instructions where the result is zero.

A full list of the 6502 instruction set is availablehere andhere (I usually refer toboth pages as they have their strengths and weaknesses). These pages detail thearguments to each instruction, which registers they use, and which flags theyset. They are your bible.

Exercises

6502
  1. You’ve seen TAX. You can probably guess what TAY, TXA and TYA do,but write some code to test your assumptions.
  2. Rewrite the first example in this section to use the Y register instead ofthe X register.
  3. The opposite of ADC is SBC (subtract with carry). Write a program thatuses this instruction.

Branching

So far we’re only able to write basic programs without any branching logic.Let’s change that.

6502 assembly language has a bunch of branching instructions, all of whichbranch based on whether certain flags are set or not. In this example we’ll belooking at BNE: “Branch on not equal”.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

First we load the value $08 into the X register. The next line is a label.Labels just mark certain points in a program so we can return to them later.After the label we decrement X, store it to $0200 (the top-left pixel), andthen compare it to the value $03.CPX compares thevalue in the X register with another value. If the two values are equal, theZ flag is set to 1, otherwise it is set to 0.

The next line, BNE decrement, will shift execution to the decrement label ifthe Z flag is set to 0 (meaning that the two values in the CPX comparisonwere not equal), otherwise it does nothing and we store X to $0201, thenfinish the program.

In assembly language, you’ll usually use labels with branch instructions. Whenassembled though, this label is converted to a single-byte relative offset (anumber of bytes to go backwards or forwards from the next instruction) sobranch instructions can only go forward and back around 256 bytes. This meansthey can only be used to move around local code. For moving further you’ll needto use the jumping instructions.

Exercises

  1. The opposite of BNE is BEQ. Try writing a program that uses BEQ.
  2. BCC and BCS (“branch on carry clear” and “branch on carry set”) are usedto branch on the carry flag. Write a program that uses one of these two.

Addressing modes

The 6502 uses a 16-bit address bus, meaning that there are 65536 bytes ofmemory available to the processor. Remember that a byte is represented by twohex characters, so the memory locations are generally represented as $0000 -$ffff. There are various ways to refer to these memory locations, as detailed below.

With all these examples you might find it helpful to use the memory monitor towatch the memory change. The monitor takes a starting memory location and anumber of bytes to display from that location. Both of these are hex values.For example, to display 16 bytes of memory from $c000, enter c000 and 10into Start and Length, respectively.

Absolute: $c000

With absolute addressing, the full memory location is used as the argument to the instruction. For example:

Zero page: $c0

All instructions that support absolute addressing (with the exception of the jumpinstructions) also have the option to take a single-byte address. This type ofaddressing is called “zero page” - only the first page (the first 256 bytes) ofmemory is accessible. This is faster, as only one byte needs to be looked up,and takes up less space in the assembled code as well.

Zero page,X: $c0,X

This is where addressing gets interesting. In this mode, a zero page address is given, and then the value of the X register is added. Here is an example:

If the result of the addition is larger than a single byte, the address wraps around. For example:

Zero page,Y: $c0,Y

This is the equivalent of zero page,X, but can only be used with LDX and STX.

Absolute,X and absolute,Y: $c000,X and $c000,Y

These are the absolute addressing versions of zero page,X and zero page,Y. For example:

Immediate: #$c0

Immediate addressing doesn’t strictly deal with memory addresses - this is themode where actual values are used. For example, LDX #$01 loads the value$01 into the X register. This is very different to the zero pageinstruction LDX $01 which loads the value at memory location $01 into theX register.

Relative: $c0 (or label)

Relative addressing is used for branching instructions. These instructions takea single byte, which is used as an offset from the following instruction.

Assemble the following code, then click the Hexdump button to see the assembled code.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

The hex should look something like this:

a9 and c9 are the processor opcodes for immediate-addressed LDA and CMPrespectively. 01 and 02 are the arguments to these instructions. d0 isthe opcode for BNE, and its argument is 02. This means “skip over the nexttwo bytes” (85 22, the assembled version of STA $22). Try editing the codeso STA takes a two-byte absolute address rather than a single-byte zero pageaddress (e.g. change STA $22 to STA $2222). Reassemble the code and look atthe hexdump again - the argument to BNE should now be 03, because theinstruction the processor is skipping past is now three bytes long.

Implicit

Some instructions don’t deal with memory locations (e.g. INX - increment theX register). These are said to have implicit addressing - the argument isimplied by the instruction.

Indirect: ($c000)

Indirect addressing uses an absolute address to look up another address. Thefirst address gives the least significant byte of the address, and thefollowing byte gives the most significant byte. That can be hard to wrap yourhead around, so here’s an example:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

In this example, $f0 contains the value $01 and $f1 contains the value$cc. The instruction JMP ($f0) causes the processor to look up the twobytes at $f0 and $f1 ($01 and $cc) and put them together to form theaddress $cc01, which becomes the new program counter. Assemble and stepthrough the program above to see what happens. I’ll talk more about JMP inthe section on Jumping.

Indexed indirect: ($c0,X)

This one’s kinda weird. It’s like a cross between zero page,X and indirect.Basically, you take the zero page address, add the value of the X register toit, then use that to look up a two-byte address. For example:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

Memory locations $01 and $02 contain the values $05 and $07respectively. Think of ($00,X) as ($00 + X). In this case X is $01, sothis simplifies to ($01). From here things proceed like standard indirectaddressing - the two bytes at $01 and $02 ($05 and $07) are looked upto form the address $0705. This is the address that the Y register wasstored into in the previous instruction, so the A register gets the samevalue as Y, albeit through a much more circuitous route. You won’t see thismuch.

Indirect indexed: ($c0),Y

Indirect indexed is like indexed indirect but less insane. Instead of addingthe X register to the address before dereferencing, the zero page addressis dereferenced, and the Y register is added to the resulting address.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

In this case, ($01) looks up the two bytes at $01 and $02: $03 and$07. These form the address $0703. The value of the Y register is addedto this address to give the final address $0704.

Exercise

  1. Try to write code snippets that use each of the 6502 addressing modes.Remember, you can use the monitor to watch a section of memory.

The stack

The stack in a 6502 processor is just like any other stack - values are pushedonto it and popped (“pulled” in 6502 parlance) off it. The current depth of thestack is measured by the stack pointer, a special register. The stack lives inmemory between $0100 and $01ff. The stack pointer is initially $ff, whichpoints to memory location $01ff. When a byte is pushed onto the stack, thestack pointer becomes $fe, or memory location $01fe, and so on.

Two of the stack instructions are PHA and PLA, “push accumulator” and “pullaccumulator”. Below is an example of these two in action.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

X holds the pixel colour, and Y holds the position of the current pixel.The first loop draws the current colour as a pixel (via the A register),pushes the colour to the stack, then increments the colour and position. Thesecond loop pops the stack, draws the popped colour as a pixel, then incrementsthe position. As should be expected, this creates a mirrored pattern.

Jumping

Jumping is like branching with two main differences. First, jumps are notconditionally executed, and second, they take a two-byte absolute address. Forsmall programs, this second detail isn’t very important, as you’ll mostly beusing labels, and the assembler works out the correct memory location from thelabel. For larger programs though, jumping is the only way to move from onesection of the code to another.

JMP

JMP is an unconditional jump. Here’s a really simple example to show it in action:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

JSR/RTS

JSR and RTS (“jump to subroutine” and “return from subroutine”) are adynamic duo that you’ll usually see used together. JSR is used to jump fromthe current location to another part of the code. RTS returns to the previousposition. This is basically like calling a function and returning.

The processor knows where to return to because JSR pushes the address minusone of the next instruction onto the stack before jumping to the givenlocation. RTS pops this location, adds one to it, and jumps to that location.An example:

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

The first instruction causes execution to jump to the init label. This setsX, then returns to the next instruction, JSR loop. This jumps to the looplabel, which increments X until it is equal to $05. After that we return tothe next instruction, JSR end, which jumps to the end of the file. Thisillustrates how JSR and RTS can be used together to create modular code.

Creating a game

Now, let’s put all this knowledge to good use, and make a game! We’re going tobe making a really simple version of the classic game ‘Snake’.

Even though this will be a simple version, the code will be substantially largerthan all the previous examples. We will need to keep track of several memorylocations together for the various aspects of the game. We can still dothe necessary bookkeeping throughout the program ourselves, as before, buton a larger scale that quickly becomes tedious and can also lead to bugs thatare difficult to spot. Instead we’ll now let the assembler do some of themundane work for us.

In this assembler, we can define descriptive constants (or symbols) that representnumbers. The rest of the code can then simply use the constants instead of theliteral number, which immediately makes it obvious what we’re dealing with.You can use letters, digits and underscores in a name.

Here’s an example. Note that immediate operands are still prefixed with a #.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

The simulator widget below contains the entire source code of the game. I’llexplain how it works in the following sections.

Willem van der Jagt made a fully annotated gistof this source code, so follow alongwith that for more details.

Notes:Memory location $fe contains a new random byte on every instruction.Memory location $ff contains the ascii code of the last key pressed.Memory locations $200 to $5ff map to the screen pixels. Different values willdraw different colour pixels. The colours are:$0: Black$1: White$2: Red$3: Cyan$4: Purple$5: Green$6: Blue$7: Yellow$8: Orange$9: Brown$a: Light red$b: Dark grey$c: Grey$d: Light green$e: Light blue$f: Light grey

Overall structure

After the initial block of comments (lines starting with semicolons), the firsttwo lines are:

init and loop are both subroutines. init initializes the game state, andloop is the main game loop.

The loop subroutine itself just calls a number of subroutines sequentially,before looping back on itself:

First, readkeys checks to see if one of the direction keys (W, A, S, D) waspressed, and if so, sets the direction of the snake accordingly. Then,checkCollision checks to see if the snake collided with itself or the apple.updateSnake updates the internal representation of the snake, based on itsdirection. Next, the apple and snake are drawn. Finally, spinWheels makes theprocessor do some busy work, to stop the game from running too quickly. Thinkof it like a sleep command. The game keeps running until the snake collideswith the wall or itself.

Zero page usage

The zero page of memory is used to store a number of game state variables, asnoted in the comment block at the top of the game. Everything in $00, $01and $10 upwards is a pair of bytes representing a two-byte memory locationthat will be looked up using indirect addressing. These memory locations willall be between $0200 and $05ff - the section of memory corresponding to thesimulator display. For example, if $00 and $01 contained the values $01and $02, they would be referring to the second pixel of the display ($0201 - remember, the least significant byte comes first in indirect addressing).

The first two bytes hold the location of the apple. This is updated every timethe snake eats the apple. Byte $02 contains the current direction. 1 meansup, 2 right, 4 down, and 8 left. The reasoning behind these numbers willbecome clear later.

Finally, byte $03 contains the current length of the snake, in terms of bytesin memory (so a length of 4 means 2 pixels).

Initialization

The init subroutine defers to two subroutines, initSnake andgenerateApplePosition. initSnake sets the snake direction, length, and thenloads the initial memory locations of the snake head and body. The byte pair at$10 contains the screen location of the head, the pair at $12 contains thelocation of the single body segment, and $14 contains the location of thetail (the tail is the last segment of the body and is drawn in black to keepthe snake moving). This happens in the following code:

This loads the value $11 into the memory location $10, the value $10 into$12, and $0f into $14. It then loads the value $04 into $11, $13and $15. This leads to memory like this:

which represents the indirectly-addressed memory locations $0411, $0410 and$040f (three pixels in the middle of the display). I’m labouring this point,but it’s important to fully grok how indirect addressing works.

The next subroutine, generateApplePosition, sets the apple location to arandom position on the display. First, it loads a random byte into theaccumulator ($fe is a random number generator in this simulator). This isstored into $00. Next, a different random byte is loaded into theaccumulator, which is then AND-ed with the value $03. This part requires abit of a detour.

The hex value $03 is represented in binary as 00000011. The AND opcodeperforms a bitwise AND of the argument with the accumulator. For example, ifthe accumulator contains the binary value 10101010, then the result of ANDwith 00000011 will be 00000010.

The effect of this is to mask out the least significant two bits of theaccumulator, setting the others to zero. This converts a number in the range of0–255 to a number in the range of 0–3.

After this, the value 2 is added to the accumulator, to create a final randomnumber in the range 2–5.

The result of this subroutine is to load a random byte into $00, and a randomnumber between 2 and 5 into $01. Because the least significant byte comesfirst with indirect addressing, this translates into a memory address between$0200 and $05ff: the exact range used to draw the display.

The game loop

Nearly all games have at their heart a game loop. All game loops have the samebasic form: accept user input, update the game state, and render the gamestate. This loop is no different.

Reading the input

The first subroutine, readKeys, takes the job of accepting user input. Thememory location $ff holds the ascii code of the most recent key press in thissimulator. The value is loaded into the accumulator, then compared to $77(the hex code for W), $64 (D), $73 (S) and $61 (A). If any of thesecomparisons are successful, the program branches to the appropriate section.Each section (upKey, rightKey, etc.) first checks to see if the currentdirection is the opposite of the new direction. This requires another little detour.

As stated before, the four directions are represented internally by the numbers1, 2, 4 and 8. Each of these numbers is a power of 2, thus they are representedby a binary number with a single 1:

The BIT opcode is similar to AND, but the calculation is only used to setthe zero flag - the actual result is discarded. The zero flag is set only if theresult of AND-ing the accumulator with argument is zero. When we’re looking atpowers of two, the zero flag will only be set if the two numbers are not thesame. For example, 0001 AND 0001 is not zero, but 0001 AND 0010 is zero.

So, looking at upKey, if the current direction is down (4), the bit test willbe zero. BNE means “branch if the zero flag is clear”, so in this case we’llbranch to illegalMove, which just returns from the subroutine. Otherwise, thenew direction (1 in this case) is stored in the appropriate memory location.

6502 Emulator Python

Updating the game state

The next subroutine, checkCollision, defers to checkAppleCollision andcheckSnakeCollision. checkAppleCollision just checks to see if the twobytes holding the location of the apple match the two bytes holding thelocation of the head. If they do, the length is increased and a new appleposition is generated.

checkSnakeCollision loops through the snake’s body segments, checking eachbyte pair against the head pair. If there is a match, then game over.

After collision detection, we update the snake’s location. This is done at ahigh level like so: First, move each byte pair of the body up one position inmemory. Second, update the head according to the current direction. Finally, ifthe head is out of bounds, handle it as a collision. I’ll illustrate this withsome ascii art. Each pair of brackets contains an x,y coordinate rather than apair of bytes for simplicity.

At a low level, this subroutine is slightly more complex. First, the length isloaded into the X register, which is then decremented. The snippet belowshows the starting memory for the snake.

The length is initialized to 4, so X starts off as 3. LDA $10,x loads thevalue of $13 into A, then STA $12,x stores this value into $15. X isdecremented, and we loop. Now X is 2, so we load $12 and store it into$14. This loops while X is positive (BPL means “branch if positive”).

Once the values have been shifted down the snake, we have to work out what todo with the head. The direction is first loaded into A. LSR means “logicalshift right”, or “shift all the bits one position to the right”. The leastsignificant bit is shifted into the carry flag, so if the accumulator is 1,after LSR it is 0, with the carry flag set.

To test whether the direction is 1, 2, 4 or 8, the code continuallyshifts right until the carry is set. One LSR means “up”, two means “right”,and so on.

The next bit updates the head of the snake depending on the direction. This isprobably the most complicated part of the code, and it’s all reliant on howmemory locations map to the screen, so let’s look at that in more detail.

You can think of the screen as four horizontal strips of 32 × 8 pixels.These strips map to $0200-$02ff, $0300-$03ff, $0400-$04ff and $0500-$05ff.The first rows of pixels are $0200-$021f, $0220-$023f, $0240-$025f, etc.

As long as you’re moving within one of these horizontal strips, things aresimple. For example, to move right, just increment the least significant byte(e.g. $0200 becomes $0201). To go down, add $20 (e.g. $0200 becomes$0220). Left and up are the reverse.

Going between sections is more complicated, as we have to take into account themost significant byte as well. For example, going down from $02e1 should leadto $0301. Luckily, this is fairly easy to accomplish. Adding $20 to $e1results in $01 and sets the carry bit. If the carry bit was set, we know wealso need to increment the most significant byte.

After a move in each direction, we also need to check to see if the headwould become out of bounds. This is handled differently for each direction. Forleft and right, we can check to see if the head has effectively “wrappedaround”. Going right from $021f by incrementing the least significant bytewould lead to $0220, but this is actually jumping from the last pixel of thefirst row to the first pixel of the second row. So, every time we move right,we need to check if the new least significant byte is a multiple of $20. Thisis done using a bit check against the mask $1f. Hopefully the illustrationbelow will show you how masking out the lowest 5 bits reveals whether a numberis a multiple of $20 or not.

I won’t explain in depth how each of the directions work, but the aboveexplanation should give you enough to work it out with a bit of study.

Rendering the game

Because the game state is stored in terms of pixel locations, rendering thegame is very straightforward. The first subroutine, drawApple, is extremelysimple. It sets Y to zero, loads a random colour into the accumulator, thenstores this value into ($00),y. $00 is where the location of the apple isstored, so ($00),y dereferences to this memory location. Read the “Indirectindexed” section in Addressing modes for more details.

Next comes drawSnake. This is pretty simple too - we first undraw the tailand then draw the head. X is set to the length of the snake, so we can indexto the right pixel, and we set A to zero then perform the write using theindexed indirect addressing mode. Then we reload X to index to the head, setA to one and store it at ($10,x). $10 stores the two-byte location ofthe head, so this draws a white pixel at the current head position. As onlythe head and the tail of the snake move, this is enough to keep the snakemoving.

6502 Emulator Online

The last subroutine, spinWheels, is just there because the game would run toofast otherwise. All spinWheels does is count X down from zero until it hitszero again. The first dex wraps, making X#$ff.