Since I got really annoyed by the current state of partitioning for the PC, I decided to wack up this short specification for a new LBA based division using a simple doubly linked list. The current way of partitioning a drive is based on the old CHS (cylinder-head-sector) addressing of the device - this is no longer appropriate as the disk does internal conversion of these values to match their actual geometry. The proposition is based solely on the LBA numbers on the disc and is thus independent of the discs geometry. Apart from that, the design has focus on the need to be able to recover the original division of the disc even if the chain is broken. This is achieved by the use of checksums and magic strings. By maintaining a simple scheme for partitioning a drive, the boot code can remain so simple it can still be implemented in the first physical block of the boot device.
As this is for the IA-32 architecture, all fields use the IA-32 byte-order - little-endian.
The physical disk is divided into slices, each of which has the following structure placed in the first sector of the slice. The structure describes the slice size and where to find the next slice, thus yielding a linked list. For safety purposes, the record also contains a pointer to the previous element. This has the purpose of making integrity checks of the entire division into slices and facilitate an easier recovery if a record should be accidentially destroyed.
The records has the following layout:
Ofs. | Len. | Name | Description |
---|---|---|---|
-2 | 2 | jump | A short jump to the location immediately after the table (usually eq to ?? ??). |
0 | 7 | magic | == "B-Slice", magic used in case of a search through the device. |
7 | 1 | headerVersion | Version of the header to allow later extensions. Later extensions MUST maintain compatibility with this specification. |
8 | 8 | prevLBA | LBA number of the previous slice descriptor. |
16 | 8 | nextLBA | LBA number of the next slice descriptor. |
24 | 8 | hiddenBlocks | Number of blocks following the Slice Descripter, which should be hidden by the OS and the bootstrap. |
32 | 8 | length | Number of blocks for use in this slice. Unless the slice is not entirely utilised, this value should be nextBlock-curBlock-1, thus the hidden sectors are included in the length value, and the length value is thus the largest valid offset from the curBlock. |
40 | 2 | systemID | ID of the system stored in the partition. The most significant byte MUST indicate the OS, that created this slice and the least significant byte SHOULD indicate the filesystem stored on the slice. hi(0) and lo(0) are reserved for indicating an uninitialised slice. |
42 | 2 | flags | As described in section "Flags". |
44 | 12 | sliceName | Unique name for the slice. Can be used to identify the slice later. |
56 | 8 | checksum | 64 bit checksum using a rotating sum initialised with the block LBA number. |
The following values are reserved in certain fields.
Field | Value | Remark | Description |
---|---|---|---|
prevLBA | 0xffffffffffffffff | (signed -1) | Indicates first slice |
nextLBA | 0xffffffffffffffff | (signed -1) | Indicates last slice |
The reason for reserving certain LBA numbers are to ensure a proper definition of the ends of the chain. The reason for not reserving 0 is that 0 is already a valid LBA number.
The flags are stored in a 2-byte field - 15:8 denotes bits 15 through 8 counted from LSB=0:
Bits | Name | Description |
---|---|---|
15:8 | resv | Reserved for future extensions |
7 | hideBlocks | A value of 1 indicates that when using this slice, the system should hide the first hiddenBlocks blocks |
6 | defaultBoot | A value of 1 indicates that this is the default slice to boot. |
5:0 | loadOnBoot | The number of blocks to load into memory before transferring execution. A value of 0 indicates that the slice is not bootable. |
As the bootstrap most likely still will run in 8086-Real-Mode, the amount of blocks to load has been limited to 63. With a block-size of 512B, the total size loaded from disc will thus be 32KiB, which will fit neatly into a real-mode segment. It is recommended to hide the blocks which will be loaded by letting hiddenBlocks be at least loadOnBoot blocks.
The reason for having a checksum is to facilitate easier recovery of the slices on the disk in case of a failure. It was inspired by the ATM way of locking on to a stream - when the checksum match, we might have a sync to the stream. Likewise here - if the checksum match, we probably have a valid slice descriptor.
The checksum MUST be initialised with the LBA number of the block in which the descriptor is placed. This is to ensure that it was an intentional placing of the descriptor and not a backup of another descriptor.
I considered different sorts of checksums, but decided finally, that a rotating sum would be easiest to implement in the restricted space available. If memory serves me right, a CRC requires quite a bit of memory for LUTs or code to make it efficient, which would consume too much of the available space in the first slice descriptor. One could also use a XOR, or maybe a rotating XOR, but I just felt it was to easy to "construct" (by malfunction) a valid, but erroneous, block.
uint64 cs = {LBA of the block in which the slice desc is placed} uint64 list data = {offset 2 to offset 56 in blocks of 8 bytes} foreach uint64 d in data cs = uint64(cs+d) cs = (cs << 8) | (cs >> 56)
The block-slices may be enforced by hooking up to the BIOS, so the operating system being booted just knows its slice-name, which can be passed to the BIOS to get the slice number, and use the slice-number as the base of its block-slice.
A way of achiving this feature without altering OSes would be to hide the original BIOS behind a layer only presenting the currently valid slive to the OS.
Alternatively, you could hook up to the BIOS read sector functions and override those with new functionality, that would hide all slices but the one the booting OS is placed on. This way, the OS would think it was alone until its kernel was entirely loaded, in which case the BIOS functions are no longer used.
The purpose of the checksum and the magic field is to ensure that if we are forced to go looking through the disk for partitions, the likelyhood of false positives is very low. The same is applicable for the doublely linked lists - its unlikely that a block would fit in a chain if it was really really old.
Algorithms for removing elements from doubly linked lists are aboundant, so repeating them here serves no purpopse. Just remeber that only the prev→next and next → prev should be updated and this record should be left unchanged to increase the chances of a proper recovery should it strike.
+------+ +------+ +------+ | prev |<---| prev |<---| prev | | next |--->| next |--->| next | +------+ +------+ +------+ | | | | | | | | | | +------+ +------+ | | | | +------+ Disk split in three slices. +------+ +------+ | prev |<---------------| prev | | next |--------------->| next | +------+<-+ +------+ -> +------+ | | +-| prev | + | | | | | next |-+ +------+ +------+ +------+ | | | | | | | | +------+
The middle slice has been removed from the list and will be overwritten if length of the first slice is adjusted. If recover is called right away it can find the slice descriptor and see it matches perfectly in the chain - that it, the prev and next pointer actually points to slice descriptors.
near-jmp end_of_desc (2 bytes op. - +64 bytes) char[7] magic = "B-Slice"; uint8 headerVersion; == 0x01 uint64 prevLBA; uint64 nextLBA; uint64 hiddenBlocks; uint64 length; uint16 systemID; uint16 flags; char[12] slice_name; uint64 checksum; // use rotating 64 bit sum initialised with the block number // this ensures that even though we are forced to start searching // from the beginning of the device, we can detect if this block // is a back-up in a filesystem. end_of_desc ; 64 bytes struct flags { unsigned loadOnBoot: 6; unsigned defaultBoot:1; unsigned hideBlocks:1; unsigned reserved: 12; }This text was written by Rune when he was really annoyed with the current partitioning system, so he decided to propose a really simple replacement, which would fulfill the goals of fitting in a sector and having all the needed facilities of multibooting serveral OS's, which he is currently doing. If you have any comments besides "that will never happen", mail me...