Programs are converted into machine code and loaded into RAM.
There’s a special CPU register that stores the address of the next instruction to be executed.
The kernel is the first program that gets loaded when the operating system starts. The kernel has full access to hardware, including the CPU.
Modern CPUs support rings:
- user - limited
- kernel - powerful
There are also other rings (on x86 there are 4).
Generally, kernel and device drivers run in kernel mode, while normal apps run in user mode. CPU starts in kernel mode. On x86, the current mode can be read from the CS register.
Since user mode is very limited, apps need a way to ask for various resources. This is why we have System Calls.
The boundary between user space and kernel space (where OS code gets executed) can be crossed using system calls (aka syscalls). These are the functions that are supported by the kernel. They can be split into some categories:
- filesystem management
- processes management
For example, to execute a binary file, the
execve syscall should be used (more
on that later).
We can see what system calls are invoked by any program by running
strace ls will show the system calls invoked by the
Syscalls use software interrupts. The OS, during boot, registers its interrupt vector table with the CPU, so that the CPU knows which code to invoke on various interrupts. Linux uses interrupt ID 80 to handle system calls. Kernel will know which syscall to invoke, because before calling the interrupt, some CPU registers (or the stack) will contain appropriate data.
User-space programs can invoke system calls via abstraction provided by the standard C library (like glibc, musl, or other). Such a library covers the whole spectrum of syscalls that the kernel supports.
We can see which libc functions are being used by a program by using
For example, when we need to read a file:
- Our app calls
- libc sets up registers of the CPU, e.g., it configures a path to the file we want to read.
- An interrupt is triggered.
- The kernel executes the syscall and writes the result somewhere to memory.
- libc returns the data to our app.
CISC CPUs (like x86) have a few interrupt instructions, while RISC (like ARM) would have just 1.
CPUs invoke instructions one after another. So, how does the operating system allow us to run hundreds of programs at once? It has to constantly switch between the programs that the CPU executes even though a given program is not yet finished.
OS does it with a timer. Timers are hardware chips that come as part of CPUs. When the timer counts down to 0, it fires some predefined interrupt routine (again, via Interrupt Vector Table supplied by the OS during boot).
So, before starting any program, the kernel sets up the timer and then starts a program. When the timer triggers an interrupt, the OS can switch the instruction pointer to another program (while saving the previous state to resume it later when the current program gets another slice of time in the future).
The described process of pausing the execution of one program to allow another one to run is called preemption. The time a process is given to run continuously is called a time slice.
Not only userland programs are preemptive. Also, kernel code is. Without it, a long-running driver or syscall could freeze the system. Linux is a preemptive kernel.
There are various strategies for switching. The simplest way is to have a fixed time slice per process. For lots of processes running, it might take a long time to go back to that process once we switch away from it.
Another strategy is to decide how long (max) it should take to get back to a process after we switch away from it. It’s called target latency. Linux (until 6.5) used to use CFS - Completely Fair Scheduler - and it used 6ms target latency. Nowadays, Linux uses EEVDF.
A syscall often used to run another program is execve. “V” stands for a vector of arguments, and “e” stands for environment variables. So, to run a program we will supply these two pieces of data to it.
Execve is responsible for preparing the program to run and running it. Part of its responsibility is finding out how to run the program, by inspecting the first 256 bytes of its content. The kernel has various handlers registered and it gives the mentioned 256-byte buffer to each one of them until one of them returns a successful return code, which signifies that this handler knows how to execute the program. For example, it could be an ELF handler.
While shebang is handled by the kernel, in cases where a file is not understood by
any handler, a shell will try to execute it as a shell script. This is why we
can execute scripts without shebang, or any special extensions like
(kernel handlers can look at file extensions as well). This behavior is
specified in POSIX.
ELF programs are handled by the binfmt_elf handler.
ELF contains a header that links to various sections within that ELF file:
- .text - contains actual program code
- .data c- contains data to be placed in memory
Programs may use other libraries via static or dynamic linking. Statically linked libraries end up as part of our program. Dynamically linked libraries may be loaded once into memory and may be reused by many programs. E.g., libc - it will be most likely used by every program.
Dynamic libs are in
.so files. These are ELF files. They contain a special
.dynsym section that lists exported functions that other programs may use.
If some program uses dynamically linked libraries, the ELF header will invoke
ELF files specify where program data should get loaded. The addresses are not used directly by the kernel. Otherwise, different programs would conflict with each other, because it’s impossible to allocate specific addresses to specific programs during compilation.
When the CPU starts, it uses physical RAM addresses. When the OS boots, it initiates memory dictionary in MMU (Memory Management Unit) (it’s a physical chip on the motherboard) and tells the CPU to use MMU to translate virtual addresses to physical ones.
When programs execute, they use virtual memory space. It could happen that different programs will use the same (virtual) address for storage of some variable. This virtual address points to different physical addresses, because the page table gets modified before switching between programs. Some example:
Program A runs. It writes data to address 0x04. Kernel checks MMU and gets physical address - let’s say 0x57. Preemption occurs. Program B runs. It wants to read some data from address 0x04. Kernel checks with MMU and gets physical address - Ox76 During context switch, the page table used by MMU has been updated so that the same virtual address - 0x04 - points to a different physical location!
This happens on every preemption, and this is why context switching is not cheap.
Other than isolation, page map also adds security. A process cannot directly access another process’s memory.
The kernel also needs to store its data somewhere. In the virtual address space, the “higher half” of the address space is kernel memory. A userland program can’t access it due to another security measure - flags. Pages belonging to the kernel are marked as kernel pages and the CPU will not let any app access them.
So, during preemption, CPU switches ring to kernel made, OS code gets executed, Page Table gets modified to point to RAM addresses of the next process, and userland ring is entered again for that new process.
If a program wants to run another process and continue to run in parallel with the new process, it cannot just use “execve”. Execve will replace the current program with the new one, killing the first program in a way.
Instead, the program should call ”fork”. It clones a process so that it runs in two instances. Now, one of the instances should just continue to run the parent code, while the other one should cell “execve” (or a similar syscall),
The process explained above is called “fork-exec” and the entire process tree is created this way.
Forking creates a separate process with the same memory as the parent had. To optimize it, the kernel doesn’t copy the whole physical memory upfront. Instead, it initially points both processes to the same physical location. (but, via separate virtual address spaces!). When any of these processes tries to write memory, the kernel creates isolated memory for that process. Processes stay isolated. This is called COW Pages (copy on write).