Could 74xx logic run Linux?

January 20, 2024

Why not?

I’ve already built something that could be considered a computer. It can’t run native code besides what is on its ROM, so it can’t run a conventional operating system, but it seems like its flaws could be easily rectified. Maybe a bigger 32-bit processor with more registers and a bigger address width could run an operating system. It could, kind of, but a real computer has a more features that are necessary to run an operating system.

Privilege

First, the operating system doesn’t want to let random programs have free reign on the computer. A broken or malicious program could do bad things like take over all the available resources or crash the computer. So our processor should have a way to limit the features available to applications over the operating system. The user program will have to ask the operating system for help if it needs to do anything dangerous, like accessing memory or I/O devices. This has an additional benefit of keeping hardware-specific code mostly in the operating system, so applications are more portable.

Memory Protection and Virtual Memory

A simple CPU, like my VMP, calculates memory addresses and then sends them directly to the external memory bus. For a multi-tasking operating system, we don’t want processes to interfere with each other’s memory. It would also be nice to have applications load at the same address in memory every time they are run. Virtual memory solves both of these problems. Every time there is a memory access, the processor tries to translate the virtual address (what the program sees) to its physical address (where it is actually located in memory). It will also check to make sure it is allowed to access that location in the way it needs to (for example, to prevent writes to shared libraries). The application then lands at consistent location and can pretend that it runs all by itself. The physical layout of the appliaction’s memory might not be contiguous, or it might not even actually exist, but the program won’t notice. The mapping of virtual to physical memory is usually done in pages of around 4KiB (although different architectures choose different values), so the bottom n bits of the address are simply copied to the physical address, while the upper bits are translated. The processor first checks its Translation Lookaside Buffer (TLB) to see if it already knows how to map the page. If it doesn’t find the page in its TLB, it will have to consult the operating system. In a hardware approach, the processor will autonomously check a specific location in memory (the page table) to check for the appropriate mapping. In a software approach (or if the page table was unable to provide the necessary information), the processor will stop and let the operating system figure out how to map (or not map) the address, and then return to the application code and try again.

Interrupts and exceptions

If the user code accesses a page that isn’t mapped or tries to do something that it isn’t allowed to, we need to raise an exception to tell the operating system so that it can try to sort out the problem. The processor needs to be able to interrupt what it is doing and start executing somewhere else in the operating system. The operating system will also need (at a minimum) a timer interrupt so that it has a chance to pause a thread that is taking too long and give other threads a chance to do some work.

Pipelining

Pipelining isn’t strictly necessary, but it would likely improve performance over a non-pipelined design. Plus it’s fun.

On the VMP, each instruction runs like this:

Increment program counter
Read the instruction from program memory
Calculate result using registers/ALU/data memory
Store result to a register or memory.

Steps 1-3 really all happen at the same time. The processor just waits for a long time to make sure all the logic settles before storing the result. Once the path length is optimized as much as possible, there isn’t anything that can be done to improve the processor’s speed without introducing some more tricks. This is where pipelining comes in. In the old style, once the instruction has reached the ALU and the ALU has started its portion of the work, the parts that come before it—program memory and decoding logic—aren’t doing anything except holding the inputs to the ALU steady so that it calculates the correct result. What if the inputs to the ALU were held at the correct levels by latches? Then the portions before the ALU can move on to more useful work. If we split up the path from input to output into stages, we end up with a pipeline. The different stages each work on different instructions, then pass on their results to the next stage on the next clock. Each instruction takes multiple cycles to get through the pipeline, but because the propagation delay of each stage is smaller, the clock can be faster to make up for some of the difference. The real benefit is that multiple instructions are in progress at the same time, so we still end up with (ideally) close to one instruction per cycle of the new, faster clock.

The ‘classic RISC’ pipeline has five stages:

Instruction fetch: load the next instruction for memory
Instruction decode: Retrieve operands from registers; maybe check branch conditions or target addresses
Execute: perform ALU or other operation
Memory access: read/write data to memory
Writeback: write to register The first two stages could possibly be combined, depending on how much decoding needs to be done.

Of course, there are lots of little problems that have to be worked out with a pipelined processor. What happens when the program takes a branch? What if the instruction needs to access memory? What if the next instruction depends on the result of the previous one?

Branch prediction

After a branch instruction enters the pipeline, it could take a couple of cycles before the decision is made, and during that time, the following instructions will have been loaded into the pipeline. If the branch ends up being taken, we have two choices: ‘flush the pipeline’ and make sure these instructions are not executed, or specify branch delay slots and force the programmer/compiler to deal with the extra instructions. To improve performance, we might add [branch prediction]. If the branch address is calculated before the next instruction needs to be loaded, we could assume that the branch will be taken, start fetching instructions from the target, and only flush the pipeline and reset the program counter if we end up not taking the branch. For a really fancy processor, the branch predictor will keep track of which branches are more likely to be taken or even recognize patterns in how often they are taken. A simpler method is to assume backward branches will be taken (like in a loop) and forward branches won’t be taken (like exiting the loop). Another way of shifting the burden to the programmer is to add separate instructions for likely/unlikely branches.

Dependencies

If we have sequential instructions that, for example, load a register from memory and then do a calculation on it, we expect that the calculation will be done with the value from the memory location we tried to load from, but with a pipeline, this might not happen. Since the next instruction will start executing before the previous instruction finishes, it may try to load the register before the previous instruction has finished writing its result to the register. There are a few options:

Do nothing; make the programmer deal with it
Insert a bubble until the dependency is satisfied
Insert a bubble (maybe) and forward the result from the first instruction (before writeback) to the next instruction

Doing nothing isn’t a very good idea because there are lots of situations that would result in unavoidable extra NOPs. Waiting until the first instruction finishes is slightly better, but still causes the same speed impact. Operand forwarding is the best of these choices. If the next instruction depends on a register-based ALU instruction, we can simply pass the result of the ALU operation to the input for the next instruction. Using the RISC pipeline from above, for a memory load, a one-cycle bubble will be necessary, since the memory load comes one cycle after any ALU operations.

The same problem could happen with memory writes and reads, but in a RISC load-store processor with lots of registers, it is unlikely the program would have to write to memory and then read back from the same location within a couple of instructions, so in this case, it might not be worth the hardware required to keep track of memory dependencies.

Memory accesses

On (almost) every cycle, the processor has to fetch a new instruction from memory. On some cycles, it may also have to read or write data to memory. The right way to solve this is to have separate instruction and data caches, but that gets complicated quickly. A simpler way is to split the clock so that the instruction fetch and data access happen on separate halves of the cycle. A modern processor runs way too fast to have memory accesses twice per cycle, but it is unlikely a discrete logic processor will reach the point where the memory access is the slowdown. The TLB lookup will take some time, but probably less time than the delay for a discrete 32-bit adder or shifter.

Maybe.

I think it is feasible to build a MIPS/RISC-V style processor capable of running Linux with a fairly reasonable size, speed, and cost. With fast logic families (like 74VHC), the clock speed could (optimistically) reach a few MHz, or at least a few hundred kHz. (The Pineapple One, a similar project implementing the RV32I base instruction set, is clocked at 500kHz.) The total cost could reach a few hundred dollars, depending on how big it gets.