From BIOS to init(). marco corvi 2003-10-28 1. Creating a bootable linux floppy. 2. Booting from the floppy: the BIOS. 3. The bootsector: bootsect 4. From 16 to 32: setup 5. From 32 to protected: head 6. start_32 7. init The Linux code is very well commented. You are urged to read this document while browsing the source code at the same time. The linux sources can be browsed at http://lxr.linux.no/ Line numbers reported on the left refer to version 2.4.19 ------------------------------------------------------------------ References Intel 80386 Programming reference manual at http://www.itu.dk/courses/OMP/notes/386reference/toc.htm Randy Dunlap "Linux 2.4.x Initialization for IA-32 HOWTO" v1.0 2001-05-17 Bri and Mark Feldman "Programming the Microsoft Mouse" http://www.geocities.com/SiliconValley/2151/mouse.html ------------------------------------------------------------------ [1] Creating a bootable linux floppy. Refs. Phrack 0x0b3c08 To make a bootable floppy with linux one has to do make zdisk (or make bzdisk if the kernel is too big). This make command is contained in arch/i386/boot/Makefile and makes a copy of the bootimage (for example BOOTIMAGE=zImage) on the floppy: dd bs=8192 if=$(BOOTIMAGE) of=/dev/fd0 The bootimage (zImage in our example) is obtained with the rule zImage: $(CONFIGURE) bootsect setup compressed/vmlinux tools/build $(OBJCOPY) compressed/vmlinux compressed/vmlinux.out tools/build bootsect setup compressed/vmlinux.out $(ROOT_DEV) > zImage OBJCOPY is "objcopy -O binary -R .note -R .comment -S". The first option specifies a raw binary output format. The option '-S' means not to copy relocation and symbol information. Finally the sections ".note" and ".comment" are removed (option '-R'). Essentially objcopy produces a memory dump of the compressed kernel vmlinux in the temporary file vmlinux.out. See the compressed directory for the construction of a compressed kernel. "tools/build" creates a disk-image from three files: - bootsect: 512 bytes of 8086 machine code - setup: 8086 code that sets up the system and parameters - system: 80386 code of the system. It is also possible to specify the root device: if the CURRENT device is used it stat's "/" and uses it as root device. Assembly sources "bootsect" and "setup" are compiled with $(AS) and linked with $(LD) -Ttext 0x0 -s --oformat binary this means to use address 0x0 as starting address (for the TEXT segment). The option '-s' says to omit all symbol information. The output format is specified "binary". "setup" is linked with the option "-e begtext" that specifies the entry for beginning execution, ie, the label "begtext" in the code. - - - - - - - - - - - - - - - - - - - - vmlinux is made by linking together head.o misc.o and piggy.o $(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o piggy.o is the data file containing the compressed kernel image, its make requires a few steps: tmppiggy=_tmp_$$$$piggy; rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk; $(OBJCOPY) $(SYSTEM) $$tmppiggy; gzip -f -9 < $$tmppiggy > $$tmpppiggy.gz; echo "SECTIONS { .data : { input_len = .; \ LONG(input_data_end - input_data) input_data = .; \ *(,data) input_data_end = .; }}" > $$tmppiggy.lnk; $(LD) -r -o piggy.o -b binary $$tmppiggy.gz -b elf32_i386 -T $$tmppiggy.lnk; rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk; The code vmlinux is linked to start at address PAGE_OFFSET+1MB (0xc0100000). All symbols have address with PAGE_OFFSET offset, eg, swapper_pg_dir is 0xc0101000, while the code is physically loaded at 0x100000, so that swapper_pg_dir is physically at 0x101000. (see arch/i386/vmlinux.lds). The kernel image is make of bootsect.o setup.o and vmlinux. Therefore it has the structure [bootsect] [setup] [ [head] [misc] [system] ] --------------------------------------------------------------------- [2] Booting from the floppy: the BIOS. Refs. S. Ghosh "Bootstrapping a Linux system - an Analysis" Linux Gazette 70 (year ?) Documentation/i386/boot.txt When you press the on/off button of your PC a special hw circuit raises the logical value of the RESET pin of the CPU. After RESET the register cr0 settings are paging disabled, protection disabled, coprocessor disabled and not monitored, and no task switch. The flag register is all 0 except bit 1 (which is Intel reserved). This results in setting the CS (code segment) and IP (instruction pointer) registers to specific fixed values: the code at 0xfffffff0 is then executed. This address is hardware mapped to permanent memory (ROM: read only memory) which stores the BIOS (basic i/o system). The BIOS consists of several interrupt driven routines that handle the hardware devices. The BIOS carries out a series of tests of the hardware (POST: power-on self test), then it initializes the hardware (IRQ and I/O ports). Then it displays a table of the PCI devices installed on the system. Finally it search for a program to execute. Where it is looked for depends on the BIOS settings. Supposing that the search sequence is floppy, cdrom, harddisk, the BIOS will load the bootsector (512 bytes) from the first device found in memory starting at physical address 0x7c00 and jumps there. On the linux bootable floppy the first sector contains "bootsect". This is the code executed after the BIOS. The traditional memory map for the kernel loader, used for Image or zImage kernels, typically looks like: | | High memory (for protected-mode code) 100000 +------------------------+ 1 MB | | 0A0000 +------------------------+ | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. 09A000 +------------------------+ | Stack/heap/cmdline | Kernel real-mode code. 098000 +------------------------+ | Kernel setup | The kernel real-mode code. 090200 +------------------------+ | Kernel boot sector | boot sector relocates here 090000 +------------------------+ | Protected-mode kernel | Kernel zImage protected-mode code. 010000 +------------------------+ | Boot loader | <- Boot sector entry point 0000:7C00 001000 +------------------------+ | Reserved for MBR/BIOS | 000800 +------------------------+ | Typically used by MBR | 000600 +------------------------+ | BIOS use only | 000000 +------------------------+ PAGE_OFFSET is defined as 0xc0000000. Linear addresses between 0 and PAGE_OFFSET-1 can be addressed in user and in kernel mode. Those above PAGE_OFFSET, are common to all processes and can be addressed only in kernel mode. The physical memory is mapped starting at PAGE_OFFSET. Kernel address are translated to physical addresses by subtracting PAGE_OFFSET (see include/asm/page.h). 128 MB are reserved for vmalloc; thus the kernel has 896 MB (MAXMEM) of addressable memory. --------------------------------------------------------------------- [3] The bootsector: bootsect. The BIOS has copied the first sector (512 bytes) from the boot device to memory address 0x07c0:0. So here we are ... arch/i386/boot/bootsect.S The bootsector is relatively simple. It does the following actions 1. move itself to INITSEG (= 09000.0) 2. get disk parameters, probe number of sectors 3. read the setup to SETUPSEG (= 09020.0) and the system to SYSSEG (= 01000.0 or 10000.0) 4. jump to SETUPSEG:0 10 BIG FAT NOTE: We're in real mode using 64k segments. Thus segment addresses must be multiplied by 16 to obtain their linear addresses Linear addresses are written using leading hex while segment addresses are written as segment:offset. In real mode, linear address equals physical address. /* * To begin a few definitions: */ SETUPSECTS = 4 // number of setup sectors BOOTSEG = 0x07c0 // where the BIOS puts bootsect (0x07c00) INITSEG = 0x9000 // where bootsect moves itself (0x90000) SETUPSEG = 0x9020 // where setup starts (0x90200) SYSSEG = 0x1000 // where the system is loaded (0x10000) SYSSIZE = 0x7F00 // system size in 16-byte units .code16 // 16 bit realmode /* 1. * At the very first move to 0x90000: */ memcpy_w( INITSEG:0, BOOTSEG:0, 256) // 256 words = 512 bytes jump INITSEG:go // this sets CS = INITSEG /* 2. * Copy the disk parameters. Reset their address (0x78) */ 82 go: // when we get here bootsect has moved to 0x90000. DS = SS = ES = INITSEG; // CS = INITSEG SP = INITSEG:0x4000-12; // set the stack, SS = INITSEG // this is high enough to contain bootsect + setup + stack // 12 bytes for the disk parameters FS = 0 // using CX which is 0 at this point // GS not used (SI,DS) = &(0:0x78); memcpy_w( INITSEG:0x4000-12, DS:SI, 6); // copy 12 bytes INITSEG:(0x4000-12) + 4 = 36; // patch sector count (0:0x78,0:0x7a) = (0x4000-12,INITSEG); // reset disk param address /* * Probe the number of .sector by trying 36, 18, 15. If none use 9. */ int[0x13] (0x0201, 0x0200, track.sector, 0x0000) // disk i/o // AX: service 2, sector 1 // BX: address 512 in INITSEG loop if CF 142 int[0x10] (0x03??, 0x0000, ...) // video int. int[0x10] (0x1301, 0x0007, 0x0009, BP=message) // read cursor position and write message "\r\nLoading" // BX: page 0, attribute 7 (normal) // CX: message length 9 /* 3. * Load the setup and the system */ 157 sread = 0x0001; // read sectors: BIOS has already read one int[0x13] (0x0000, ?, ?, 0x??00) // reset floppy controller AX = setup_sects; // initially 4 do { // read the setup code if ( sectors - sread < setup_sects) { AX = sectors - sread; } read_track( AX ); set_next( BX ); // set BX properly setup_sects -= AX; } while ( setup_sects != 0 ); 181 ES = SYSSEG; // read the system read_it(); kill_motor(); print_nl(); /* * Check which root device to use */ 195 if ( root_dev == 0 ) { switch (sectors) { case 15: root_dev = 0x0208; break; case 18: root_dev = 0x021c; break; case 36: root_dev = 0x0220; break; default: root_dev = 0x0200; } } /* 4. * jump to setup code * At this time CS = DS = SS = INITSEG (= 9000) * ES = SYSSEG (= 1000: the kernel is here) * FS = 0 * GS unused */ 219 goto SETUPSEG:0; // jump to 0x90200 (setup is here) /* * Next there are a few utility routines. */ 213 read_it: // read the system if ( ES & 0x0fff ) die; do { if ( __BIG_KERNEL__ ) { call bootsect_kludge(); // the routine is in setup.S } else { AX = ES - SYSSEG; AX += (BX >> 4); } if ( AX >= SYSSIZE ) return; CX = ( (sectors - sread) << 9) + BX; if ( ... ) { AX = (-BX) >> 9; } read_track(); set_next( BX ); } 270 read_track: int[0x10] (0x0e2e, 0x007, ..., ...); // print '.' int[0x13] (0x02??, ..., track.sread, head.00); 299 set_next ... 367 print_nl 377 print_hex 380 print_digit 397 kill_motor int[0x13] (0x0000, ..., ..., 0x0000); 417 .org 497 /* At the very end bootsect.S advances the location counter at byte 497 in the sector (512 bytes). Then follow some room for the data: */ 0x01f1 setup_sects: .byte SETUPSECTS 0x01f2 root_flags: .word ROOT_RDONLY 0x01f4 syssize: .word SYSSIZE // of compressed image (in sectors) 0x01f6 swap_dev: .word SWAP_DEV // obsolete 0x01f8 ram_size: .word RAMDISK 0x01fa vid_mode: .word SVGA_MODE 0x01fc root_dev: .word ROOT_DEV // high=major low=minor 0x01fe boot_flag: .word 0xaa55 The zero page is used to transfer data recovered during the booting process to the kernel. Besides the above informations, and those listed below in setup.S, it is filled with: 0x0000 screen info 0x0002 memory 0x0040 APM (20B) 0x0080 hd0 0x0090 hd1 // if present 0x00a0 MCA table (16B) 0x01e0 memory (by e801) 0x01e8 E820NR 0x01f2 // mount root rdonly 0x01f8 // ramdisk flags 0x01fc // orig root device 0x01ff aa // if mouse detected 0x02d0 E820 map // up to 0x0600 0x0800 command line --------------------------------------------------------------------- [4] From 16 to 32: setup Refs. A20 - pain from the past http://www.win.tue.nl/~aeb/linux/kbd/A20.html Documentation/i386/boot.txt /usr/src/linux/include/asm/e820.h http://www.symonds.net/~abhi/files/mm/ The setup code is more complicated than the bootsect. Its task is to prepare the C environment for the kernel. To quote the source: 780 # Well, that certainly wasn't fun :-(. Hopefully it works, and we don't 781 # need no steenking BIOS anyway (except for the initial loading :-). 782 # The BIOS-routine wants lots of unnecessary data, and it's less 783 # "interesting" anyway. This is how REAL programmers do it. It does a lot of things: 1. Reset the disk controller, make sure that setup is all at the right place 2. Get the size of the memory 3. Set the keyboard repeat rate 4. Check the video adapter 5. Get the hard disk data, and check if hd1 is there 6. Check for the Micro Channel bus (MCA) 7. Check the PS/2 mouse 8. Check the APM 9. Move to 32-bit mode 10. If the system is not "big" move it down to 0x0100:0 11. If setup is not at 0x9000:0 move it there 12. Enable A20 (high memory) 13. Set up IDT and GDT 14. Jump to the system head arch/i386/boot/setup.S 73 begtext: // label where code starts 80 start: // CS=0x9020 IP=0x0 and we are in 16 bit mode goto trampoline; // -> line 164 // this is the jump instruction at 0x0200 /* This leaves space for additional header data. 0x0202 header "HdrS" // 0x53726448 0x0206 version .word // boot protocol version 0x0208 realmode_swtch: .word 0, 0 0x020c start_sys_seg: .word SYSSEG 0x020e kernel_version // linux version is at kernel_version+0x200 0x0210 type_of_loader: .byte 0 // 0 LILO, 1 LoadLin, 2 bootsect, ... 0x0211 loadflags: .byte 0 // 0x01 LOADED_HIGH // 0x80 CAN_USE_HEAP 0x0212 setup_move_size: .word 0x8000 0x0214 code32_start: .long 0x1000 // 0x100000 for __BIG_KERNEL__ 0x0218 ramdisk_image: .long 0 0x021c ramdisk_size: .long 0 0x0220 bootsect_kludge .long 0x0224 heap_end_ptr .word // relative to start_of_setup (0x200) 0x0226 pad 0x0228 cmd_line_ptr .long 0x022c initrd_addr_max .long */ 164 trampoline: call start_of_setup // -> line 168 .space 1024 // and skip to byte 1024 // so that start_of_setup is at offset 0x0400 /* 1. * Reset the disk controller. Check if "setup" has been completely loaded * setup has a "signature" at the end, if it is missing the loader has * loaded part of setup at the beginning of the system. */ 168 start_of_setup: int[0x13] (0x1500, ..., ..., 0x??81) // read DASD type of 2-nd driver int[0x13] (0x0000, ..., ..., 0x??80) // reset disk controller if (SETUPSEG:signature is bad) goto bad_sig; // -> line 228 goto good_sig; // -> line 267 /* * A few utility routines */ 194 prtstr: // routine to print a asciiz (0-terminated) string repeatedly call prtchr terminates when the char to print is '0' 205 prtsp2: // routine to print two spaces 206 prtspc: // routine to print one space 209 prtchr: // routine to print one char (in AL) int[0x10] (0x0eAL, 0x00BL, 0x0001, ...) BL = color (0x07 = white) BH = page CL = number of chars 219 beep: // routine to print a "beep" (0x07) /* * If the setup signature could not be found at the first try, it * means that we still need to move the rest of "setup" from * the "system" in the right place. */ 228 bad_sig: // try to find the rest of "setup" BX = INITSEG:497 - 4; // INITSEG:(497) contains SETUPSECTS // - 4 since LILO loads 4 sectors of "setup" CX = BX * 256; // words to load BX = (BX >> 3) + SYSSEG; // real beginning of system segments CS:start_sys_seg = BX // update start_sys_seg -> line 89 memcpy_w( SETUPSEG:2048, SYSSEG:0, CX); // N.B. 2048 since 4 sectors already loaded by LILO if (SETUPSEG:signature is bad) panic; // endless no_sig_loop /* * Check if the system is loaded low, or if the loader is o.k. */ 267 good_sig: DS = CS - DELTA_INITSEG; // INITSEG if ( ! CS:loadflags & LOADED_HIGH ) goto loaded_ok; // -> line 288 if ( CS:type_of_loader == 0 ) panic; // -> line 95 /* 2. * Get the memory size. * Try three schemes (first two only if STANDARD_MEMORY_BIOS_CALL is defined) * - 0xe820 which assembles a memory map; * - 0xe801 which returns 32-bit memory size; * - 0x88 which returns between 0 and 64 M. * The E820MAP is defined at 0x02d0 with a max of E820MAX=32 entries. * The location for the number of entries in the list, E820NR, is 0x01e8 * * The memory data are stored as a list of address/size. * In arch/i386/kernel/setup.c, this information is transferred into the * e820map, and in arch/i386/mm/init.c, that new information is used to * mark pages reserved or not. * * Each E820MAP record is 20 bytes (addr:8, size:8, type:4) * Memory types: 0x01 available to OS * 0x02 reserved (not-available) * 0x03 ACPI (usable by OS after reading ACPI tables) * 0x04 ACPI NVS (OS must save this memory between NVS sessions) */ 288 loaded_ok: INITSEG:0x01e0 = 0; DI = E820MAP; for ( E820NR=0; E820NR video.S /* 5. * Get hard disk data and check if there is hd1 * hd0 data in INITSEG:0x0080 * hd1 data in INITSEG:0x0090 */ 402 DS = 0; (SI:DS) = &(4*0x41); ES = CS - DELTA_INITSEG; // ES = INITSEG memcpy_b( ES:0x0080, DS:SI, 0x10) (SI:DS) = &(4*0x46); memcpy_b( ES:0x0090, DS:SI, 0x10) ret = int[0x13] (0x1500, ..., ..., 0x??81) if (ret != 0) { ES = CS - DELTA_INITSEG; memset_b( ES:0x0090, 0, 0x10); } /* 6. * Check for the microchannel bus (MCA) * the table goes in INITSEG:0x00a0 */ 445 DS = CS - DELTA_INITSEG; DS:0xa0 = 0; // set table length to 0 ret = int[0x15] (0xc000, ...) // get table features in ES:BX if (ret != 0) goto no_mca; DS = ES; ES = CS - DELTA_INITSEG; CX = DS:BX + 2; // table size + 2 byte for the "length" if ( CX != 0x10 ) CX = 0x10; memcpy_b( ES:0xa0, DS:BX, CX ); // see Howto for further info /* 7. * Check the PS/2 mouse. * The mouse goes at 0x01ff */ 475 DS = CS - DELTA_INITSEG; DS:0x01ff = 0; // by default no mouse ret = int[0x11] (...) if (ret & 0x04) DS:0x01ff = 0xaa; // mouse present /* 8. * Check APM if defined(CONFIG_APM) * APM BIOS data is at 0x0040 - 0x0053 */ 489 DS:0x0040 = 0; // default no APM ret = int[0x15] (0x5300, 0x0000, ...) if (ret != 0 || BX != 0x504d || CX & 0x02 ) goto done_apm_bios; // no APM ret = int[0x15] (0x5304, 0x0000, ...); // ignore ret ESI = 0; DI = 0; ret = int[0x15] (0x5303, 0x0000, 0x0000, 0x0000) if (ret != 0) { DS:0x4c &= 0xfffd; // no_32_apm_bios; } else { DS:0x42 = AX; // BIOS code segment DS:0x44 = EBX; // BIOS entry point offset DS:0x48 = CX; // BIOS 16-bit code segment DS:0x4a = DX; // BIOS data segment DS:0x4e = ESI; // BIOS code segment length DS:0x52 = DI; // BIOS data segment length ret = int[0x15] (0x5300, 0x0000, 0x0000, ...) if (ret != 0 || BX != 0x504d) { int[0x15] (0x5304, 0x0000, ...); // apm_disconnect } else { DS:0x40 = AX; // APM BIOS version DS:0x4c = CX; // APM BIOS flags } } /* 9. * Finally move to 32-bit realmode * and copy the starting 32-bit address to the ljump instruction */ 547 if (CS:realmode_swtch != 0) { // if the loader has installed a realmode lcall CS:realmode_swtch; // switch, execute it } else { // do the normal realmode switch save CS call default_switch; } 561 CS:code32 = CS:code32_start; // this modifies the jump at line 813 // see 0x214 for code32_start /* 10. * If the system is not big move it to the right place * from 0x1000:0 to 0x0100:0 */ 566 if (CS:loadflags & LOADED_HIGH == 0 ) { BP = CS - DELTA_INITSEG; ES = 0x0100; for ( DS=CS:start_sys_seg; DS= SS) { AX -= SS + INITSEG; } CX = CS:setup_move-size; memcpy_b( INITSEG:CX-1, AX:CX-1, CX-(move_self_here+0x200)) // N.B. the copy is downward (std) goto SETUPSEG:move_self_here 645 move_self_here: memcpy_b( INITSEG:..., DS:..., move_self_here+0x200 ) DS = SETUPSEG; SS = AX; /* 12. * Now enable A20 (necessary to access high memory >= 1M). * Use the output port of the keyboard: bit 0 reset the CPU (-> go to * real mode when it is 0), bit 1 enable A20 (when it is 1). * To set the output port write 0xd1 to port[0x64] and then write * to port[0x60]. Alternatively can use system control port 0x92. */ 674 for (a20_tries=A20_MAX_TRIES; a20_tries>0; a20_tries--) { if ( a20_test() != 0) goto a20_done; // A20 already enabled int[0x15] (0x2401, ...); // try BIOS int 15 // 0x2400 disable, 0x2401 enable if ( a20_test() != 0) goto a20_done; call empty_8042; // try keyboard controller // 8042 is the kbd controller if ( a20_test() != 0) goto a20_done; port[0x64] = 0xd1; call empty_8042; port[0x60] = 0xDF; call empty_8042; for (i=0; i<...; i++) if ( a20_test() != 0) goto a20_done; port[0x92] = (port[0x92] | 0x02) & 0xfe; // system control A for (i=0; i<...; i++) if ( a20_test() != 0) goto a20_done; } hlt; // die with error message; 751 a20_done: // The routines a20_test, empty_8042, etc. are further down in the code // as well as delay() /* 13. * Before moving to protected-mode (PE) must set the IDT and GDT * Use a flat memory system: both code and data have 0-4GB range segments. * Reset any possible coprocessor, ... */ 754 lidt idt_48; *(gdt_48+2) = (DS << 4) + gdt; // convert DS:gdt to a linear address lgdt gdt_48; // gdt_48 contains limit+address of gdt port[0xf0] = 0; // reset coprocessor delay; port[0xf1] = 0; delay; port[0xa1] = 0xff; // mask interrupts delay; port[0x21] = 0xfb; // mask irq (except cascade) /* 14. * move to protected mode (lmsw and the intrasegment jump) * ... and now jump into the kernel * go to address 0x1000 (or 0x100000) in segment __KERNEL_CS__ * which is instruction boot/head.S:start_32 * */ CR0 = 1; // load machine status, lms; set PE bit, Prot. Enable jump flush_instr; // this jump flushes instruction prefetch queue 797 flush_instr: BX = 0; // flag indicating a boot ESI = (CS - DELTA_INITSEG) << 4; // 32-bit pointer 0x90000 812 .byte 0x66, 0xea // intersegment jump with immediate address // prefix 0x66 changes the operand sizes code32: .long 0x1000 // address .word __KERNEL_CS__ // segment selector 0x10 (segm. 2, gdt, DPL 00) /* * Now follows some info ... and useful routines. */ 829 default_switch: port[0x70] = 0x80 840 bootsect_helper: if ( CS:bootsect_es == 0 ) { CS:type_of_loader = 0x20; CS:bootsect_src_base+2 = (ES >> 4); CS:bootsect_es = ES; return SYSSEG; } else { if ( BX != 0 ) { ES = CS; SI = bootsect_gdt; ret = int[0x15] (0x8700, ...) if (ret != 0) bootsect_panic(); ES = CS:bootsect_es; // ES points to 0x10000 *(CS:bootsect_dst_base+2) ++; } return ( (CS:bootsect_dst_base+2) << 4 ) & 0xff00; } 880 // here is the memory for bootsect data 927 a20_test: FS = 0; // write to low memory 0:0x0200 GS = 0xffff; // and check if it appears at 0x10000:0x0200 AX = FS:A20_TEST_ADDR; // (high memory) for ( ...; ...0; CX--) { port[0x80] = AL; AL = port[0x64]; if (AL & 0x01 && ! AL & 0x10 ) break; port[0x80] = AL; AL = port[0x60]; // read scancode } return; 989 gettime: // read the cmos clock int[0x1a] (0x0200, ...) return (DH>>4) * 10 + (0x0f & DH); 1004 delay: port[0x80] = AL /* * finally the room for a provisional idt and gdt * * gdt has four descriptors equals to the first four of the final gdt * (see kernel/head.S). Selectors are defined in include/asm-i386/segment.h * __KERNEL_CS 0x10 * __KERNEL_DS 0x18 * __USER_CS 0x23 * __USER_DS 0x2B */ gdt: 0 dummy // 0 0 0 0 1 unused // 0 0 0 0 2 code // 00cf.9a00.0000.ffff 3 data // 00cf.9200.0000.ffff idt_48: 0 // limit 0 0 0 // base 0 gdt_48: 0x8000 // limit 2048 ( 256 descriptors ) 0 0 // base is filled at runtime with the address of gdt 1034 #include "video.S" // the signature // and some spare space 1043 modelist: --------------------------------------------------------------------- 5 vmlinux.lds.S a script to put together the kernel. arch/i386/vmlinux.lds.S This program is compiled with the C preprocessor with the macros in asm-i386/page_offset.h. The output is vmlinux.lds (see the Makefile is arch/i384). This file define the PAGE_OFFSET_RAW to 0xc0000000 if the kernel is CONFIG_1GB and other values if configured differently. first the absolute address is specified (0xc0100000 = 3GB+1MB): . = PAGE_OFFSET_RAW + 0x100000 -------------------------------- .text *(.text) text section *(.fixup) *(.gnu.warning) 0x9090 *(.text.lock) out-of-line lock text *(.rodata) read-only data *(.kstrtab) kernel string table *(__ex_table) exception table *(__ksymtab) kernel symbol table _etext -------------------------------- .data *(.data) data section CONSTRUCTORS _edata -------------------------------- *(.data.init_task) init task __init_begin: *(.text.init) init code *(.data.init) init data __setup_start: *(.setup.init) __initcall_start: *(.initcall.init) __init_end ---------------------------- *(.data.idt) *(.data.cacheline_aligned) -------------------------------- _bss_start: *(.bss) bss section _end end of kernel in memory -------------------------------- *(.text.exit) section to be discarded *(.data.exit) *(.exitcall.exit) ... stab debug sections *(.comment) comments -------------------------------------------------------------------- 6. From 32 to protected In arch/i386/kernel there are three targets: head.o init_task.o kernel.o --------------------------------------------------------------------- 7. head.S arch/i386/boot/compressed/head.S This contains the 32-bit startup code. It sets up the segments and the stack pointer, and it clears the BSS. All the job is done by decompress_kernel(); The only thing left is to make sure the kernel is at 0x100000 (1MB) (since the kernel code is linked with 1MB offset from PAGE_OFFSET) and to jump there. 31 startup_32() { clear_interrupts(); DS = ES = FS = GS = __KERNEL_DS__; // 0x18: descr. 3, gdt, DPL 00 ESP = &(stack_start); // stack_start -> kernel/head.S:388 SS = &(stack_start+4); // SP immediately after SS for (i=0; ; i++) { // check that A20 is enabled *(00000) = i; if ( *(100000) != *(000000) ) break; } 52 flags = 0; memset_b(_edata, 0, _end - _edata); // clear BSS 67 place 16 bytes on the stack; // for high and low buffers // ESI = real_mode_pointer; if ( decompress_kernel( stack, ESI ) != 0) { // this happens if we were loaded high: // copy the move_routine to low (0x1000) address // and goto it (with buffer addresses in the 16B stack) memcpy_l(0x1000, move_routine_start, (move_routine_end-move_routine_start+3)/4 ); EBX = real_mode_pointer; get low and high buffers (and counts) from the stack EDI = 0x100000 // address to move to goto __KERNEL_CS, 0x1000 // goto move_routine_start } else { goto __KERNEL_CS, 0x100000 } // this jumps to instruction startup_32 (addr. 0x100000) in kernel/head.S // boot/compressed/head.S and misc (decompression) gets overwritten by // the kernel image 110 move_routine_start: memcpy_b( 0x100000, low_buffer_start, lcount); // EDI=0x100000 memcpy_b( ..., high_buffer_start, hcount); ESI = real_mode_pointer; // ESI = EBX EBX = 0; goto __KERNEL_CS, 0x100000 --------------------------------------------------------------------- 8. Decompressing the kernel. arch/i386/boot/compressed/misc.c 4 * This is a collection of several routines from gzip-1.0.3 5 * adapted for Linux. --------------------------------------------------------------------- 9. Kernel "startup_32": arch/i386/kernel/head.S The kernel is at physical address 0x100000 and is linked with adresses offsetted by __PAGE_OFFSET, which is defined as PAGE_OFFSET_RAW (0xc0000000) in include/asm/page.h /* 1. entry point */ 44 startup_32: DS = ES = FS = GS = __KERNEL_DS; /* 2. initialize page tables pg0 and pg1 * By the end of this loop the provisional initial 8MB (each pg0 and pg1 * have 1K entries, four bytes per entry) are page mapped. * This mapping assigns the first 8MB to pg0+pg1. */ 84 EAX = 007; // lowest three bits are USER RW PRESENT for (EDI = pg0-__PAGE_OFFSET; EDI!=empty_zero_page-__PAGE_OFFSET; EDI++) { EDI = EAX; EAX += 0x1000; } /* 3. Set up paging * * The page directory is swapper_pg_dir. * The kernel is linked with addresses at offset 0xC010.0000: * page dir is 1100.0000.00 (768): thus the kernel page pg0 must be entry 768 * of swapper_pg_dir. * * The provisional page table maps 8MB of kernel addresses (with PABE_OFFSET) * to 0-8MB. After enabling paging all addresses are translated, therefore * there must be valid page tables. Furthemore, until the EIP has gone through * the address translation the physical addresses and the kernel addresses * must point to the same memory locations. * Hence the page directory swapper_pg_dir replicates the pages for addresses * at 0xc0000000 and 0x00000000: this way kernel virtual addresses (with * offset PAGE_OFFSET) and physical addresses that go through address * translation point to the same physical addresses: * * Example virtual addr. * 0xc0101000 = 1100.0000.00 01.0000.0001 0000.0000.0000 * pg_dir 0x300 page 0x101 addr in the page * 0x00101000 = 0000.0000.00 01.0000.0001 0000.0000.0000 * pg_dir 0x0 page 0x101 * page 0x101 is in pg0 and has offset 0x101*0x1000, therefore the * linear addr. equals the physical addr. * * This identity mapping is discarded when the MM is started. */ 96 CR3 = swapper_pg_dir-__PAGE_OFFSET; // setup page-table pointer CR0 |= 0x8000000; // enable paging SS:ESP = &(stack_start); // setup stack pointer and SS --> 322 122 for (EDI=__bss_start; EDI<_end; EDI++) { EDI = 0; // clear BSS } /* 4. setup_idt() is defined at line 306: * it loads IDT with 256 entries pointing to ignore_int() [line 330] * which printk "Unknown interrupt\n" */ 133 call setup_idt(); /* 5. copy the real mode informations into the zero-page * 2KB of zero-page for boot params * 2KB for command line * ESI still points to real-mode data (INITSEG = 0x90000) */ 139 flags = 0; // clear flags memcpy_l( empty_zero_page, real_mode_data, 512); memset_l( empty_zero_page+4*512, 0, 512); // find the command line and copy it to the second 2K ESI = empty_zero_page + NEW_CL_POINTER; // NEW_CL_POINTER = 0x228 ... // checkCPUtype: // first it is 386 if AC can't be flipped // second it is 486 if cannot change ID // last read the cpu data with 209 cpuid( EAX=... ) // check the math coprocessor [line 280] 242 check_x87() ... /* 6. Reload segments */ 244 load gdt_descr load idt_descr DS = ES = FS = GS = __KERNEL_DS; SS:ESP = &(stack_start); load ldt (0) 270 call start_kernel() // everything is in place for the kernel ( init/main.c:start_kernel() ) // So, when the kernel starts there are 8 MB of memory addresses mapped // to page tables (pg0 and pg1) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Now comes the structures: 364 idt_descr: IDT_ENTRIES*8-1 // limit: 256*8-1 idt: idt_table // address 370 gdt_descr: GDT_ENTRIES*8-1 // limit gdt: gdt_table // address // A few pages are defined in head.S: // Page table entry syntax: // address[31-12] . avl[11-9] 0 0 D A 0 0 U W P // D dirty // A // U user(1) supervisor(0) // W write(1) read(0) // P present(1) // // for example entry 768 has // address 00102.000 (pages are 4KB aligned) // user (U), write (W) and present (P) // address translation example: consider c010.200a (ie. address of pg0) // CR3 -------------------> swapper_pg_dir // PDE=1100000000 (768) --> 00102007 --> pg0 // PTE=0100000010 (258) --------------------> 00102007 ---> 00102000 // OFF=00000000000a --------------------------------------> 0010200a 0x1000: swapper_pg_dir --> 381 addr. 00101000 0x00102007 // pointer to pg0 0x00103007 // pointer to pg1 0 // 776 null entries (mapping up to 3GB) 0x00102007 // entry 768 is the first kernel page 0x00103007 0 // 254 null entries (last GB) 0x2000: pg0 (it is ref. also in include/asm/pgtable.h). --> 396 addr. 00102000 0x3000: pg1 0x4000: empty_zero_page 0x5000: stext or _stext // real beginning of normal text segment // Then there is the data section. // Numbers are selectors (index+gdt+DPL), the table is gdt, segments are 4GB // // Data descriptor syntax: // base[31-24] . G B 0 avl limit[19-16] . P dpl 1 0 E W A . base[23-16] // base[15-0] . limit[15-0] // Code descriptor syntax // base[31-24] . G D 0 avl limit[19-16] . P dpl 1 1 C R A . base[23-16] // base[15-0] . limit[15-0] // // all segments have base 0 // kernel and user segments have limit fffff: G=1 thus 4GB for kernel and // user; G=0 (byte granularity) thust 1MB for APM // data big (B) is set, code default size is 32-bit (D=1), except 16-bit APM // all segments are present (P=1) // privilege levels (dpl) are 00 for kernel and APM, 11 for user // types are 11 for code, and 10 for data/setup // expand down (E) or conforming (C) are clear // readable (R) or writable (W) are set // accessed bit (A) is clear gdt_table: NULL descriptor not used 0x10 kernel code // 00.cf.9a.00.00.00.ff.ff 0x18 kernel data // 00.cf.92.00.00.00.ff.ff 0x23 user code // 00.cf.fa.00.00.00.ff.ff 0x2b user data // 00.cf.f2.00.00.00.ff.ff not used not used 0x40 APM setup // 00.40.92.00.00.00.00.00 0x48 APM code // 00.40.9a.00.00.00.00.00 0x50 APM code 16 bit // 00.00.9a.00.00.00.00.00 0x58 APM data // 00.40.92.00.00.00.00.00 space for TSS and LDT // filled with 0 -------------------------------------------------------------- 10. arch/i386/kernel/init_task.c initializes the basic structures of the task 'init': vm_area_struct init_mmap = INIT_MMAP; 9 fs_struct init_fs = INIT_FS; 10 files_struct init_files = INIT_FILES; 11 signal_struct init_signals = INIT_SIGNALS; 12 init_mm = INIT_MM(init_mm); 21 init_task_union = { INIT_TASK(init_task_union.task) }; 32 tss_struct init_tss[NR_CPUS] = { [0 ... NR_CPUS-1] = INIT_TSS }; // allowed to go in the (.data.cacheline_aligned) INIT_FS: include/linux/fs_struct.h:13 count = 1 lock = RW_LOCK_UNLOCKED umask = 0022 root = NULL pwd = NULL altroot = NULL rootmnt = NULL pwdmnt = NULL altrootmnt = NULL INIT_FILES: include/linux/sched.h:190 count = 1 file_lock = RW_LOCK_UNLOCKED max_fds = NR_OPEN_DEFAULT (=BITS_PER_LONG) max_fdseet = __FD_SETSIZE next_fd = 0 fd = &(init_files.fd_array[0]) close_on_exec = &(init_files.close_on_exec_init) open_fds = &(init_files.open_fds_init) close_on_exec_init = { {0, } } // no close-on-exec open_fds_init = { {0, } } // none opened fd_array = { NULL, } // empty file* array INIT_SIGNALS: include/linux/sched.h:260 count = 1 siglock = SPIN_LOCK_UNLOCKED action = { {{0, }}, } INIT_MM( init_mm ): include/linux/sched.h:242 mm_rb = RB_ROOT = (rb_root_t) { NULL, } include/linux/rbtree.h:117 pgd = swapper_pg_dir mm_users = 2 // users with user-space mm_count = 1 // refs to mm_struct mmap_sem = __RWSEM_INITIALIZER( init_mm.mmap_sem ) mmlist = LIST_HEAD_INIT( init_mm.mmlist ) page_table_lock = SPIN_LOCK_UNLOCKED // INIT_TASK is the first task table // INIT_TASK(init_task): include/linux/sched.h:443 state = 0 flags = 0 sigpending = 0 addr_limit = KERNEL_DS exec_domain = &default_exec_domain lock_depth = -1 counter = 100 msec nice = 0 policy = SCHED_OTHER ... mm = NULL active_mm = &init_mm cpus_runnable = -1 cpus_allowed = -1 run_list = LIST_HEAD_INIT(init_task.run_list) next_task, prev_task, p_opptr, p_pptr = &init_task thread_group = LIST_HEAD_INIT(init_task.thread_group) wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(...) real_timer cap_effective cap_inheritable cap_permitted = CAP_FULL_SET // (~0) keep_capabilities = 0 ... rlim = INIT_RLIMIT ( --> include/asm/resource.h ) user = INIT_USER = root_user = { count:1, processes:1, files:0 } comm = "swapper" thread = INIT_THREAD fs = &init_fs files = &init_files sig = &init_signals sigmask_lock = ... UNLOCKED pending = { NULL, tsk.pending_head, { {0} } } blocked = { {0} } alloc_lock = ... UNLOCKED journal_info = NULL ... INIT_THREAD: include/asm-i386/processor.h:390 esp0 0 eip 0 esp 0 fs 0 gs 0 debugreg[8] = { [0 ... 7] = 0 } cr2 0 trap_no 0 error_code 0 i387 { {0, } }, vm86_info 0 screen_bitmap 0 v86flags 0 v86mask 0 saved_esp0 0 ioperm 0 io_bitmap { ~0, } --------------------------------------------------------------------- 11. start_kernel in init/main.c:348 356 lock_kernel() // lock kernel_flag 357 printk( kernel_banner ) 358 setup_arch( & command_line ) 360 parse_options( command_line ) 361 trap_init() 362 init_IRQ() 363 sched_init() 364 softirq_init() 365 time_init() 372 console_init() 374 init_modules() # ifdef CONFIG_MODULES 386 kmem_cache_init() 387 sti() 388 calibrate_delay() 397 mem_init() 398 kmem_cache_sizes_init() 399 pgtable_cache_init() 408 if (num_mappedpages == 0) num_mappedpages = num_physpages 411 fork_init(num_mappedpages) 412 proc_caches_init() 413 vfs_caches_init(num_physpages) 414 buffer_init(num_physpages) 415 page_cache_init(num_physpages) 419 signals_init() 421 proc_root_init() #ifdef CONFIG_PROC_FS 424 ipc_init() #ifdef CONFIG_SYSVIPC 426 check_bugs() 434 spm_init() 435 rest_init() - - - - - - - - - - - - - - - - - - - [1] setup_arch( cmdline ) arch/i386/kernel/setup.c:871 first it retrieves some hardware informations and stores it in global variables (PARAM is the empty_zero_page): ROOT_DEV = to_kdev_t(ORIG_ROOT_DEV); // PARAM + 0x01fc drive_info = DRIVE_INFO; // PARAM + 0x0080 screen_info = SCREEN_INFO; // PARAM + 0 apm_info.bios = APM_BIOS_INFO; // PARAM + 0x0040 if( SYS_DESC_TABLE.length != 0 ) { // PARAM + 0x00a0 MCA_bus = SYS_DESC_TABLE.table[3] &0x2; machine_id = SYS_DESC_TABLE.table[0]; machine_submodel_id = SYS_DESC_TABLE.table[1]; BIOS_revision = SYS_DESC_TABLE.table[2]; } aux_device_present = AUX_DEVICE_INFO; // PARAM + 0x01ff Then it reads the memory regions from e820 setup_memory_region(); // in arch/i386/kernel/setup.c Now it assigns _text, _etext, _edata, _end (start/end) to init_mm and code_resource, data_resource (for i386 virt_to_bus is virt_to_phys, which is __pa, ie, subtract PAGE_OFFSET). Then parse the command line, and call setup_memory(). This defines start_pfn // first frame above _end max_pfn // last frame max_low_pfn // last frame of low memory highstart_pfn // first frame of high memory initializes the boot-time memory allocator bootmap_size = init_bootmem(start_pfn, max_low_pfn); register_bootmem_low_pages(max_low_pfn); reserve_bootmem(from 1MB to the bootmem-bitmap included); reserve_bootmem(0-page); // may also reserve bootmem for the initial ramdisk Initialize paging paging_init(); // arch/i386/mm/init.c // pagetable_init() // allocates and initializes kernel page tables // load cr3 with swapper_pg_dir // if the CPU has PAE set the bit in cr4 // flush TLB // zone_sizes_init() register_memory(max_low_pfn); - - - - - - - - - - - - - - - - - - - rest_init() init/main.c:336 kernel_thread( init, NULL, CLONE_[FS,FILES,SIGNAL] ) unlock_kernel() // unlock kernel_flag current->need_resched = 1 cpu_idle() // endless idle loop (with power-management) kernel_thread( fct, arg, flag ) arch/i386/kernel/process.c:488 ESI = ESP int_80( clone, flag|CLONE_VM ) if ( ESP == ESI ) returni EAX; // parent fct( arg ) int_80( exit )