From 2be3f60b48fdc44de26162dbcbb6245842e90d64 Mon Sep 17 00:00:00 2001 From: Alcaro Date: Thu, 4 Jun 2020 02:14:15 +0200 Subject: [PATCH] Add BPS spec --- bps_spec.md | 238 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 238 insertions(+) create mode 100644 bps_spec.md diff --git a/bps_spec.md b/bps_spec.md new file mode 100644 index 0000000..caf300a --- /dev/null +++ b/bps_spec.md @@ -0,0 +1,238 @@ +# BPS File Format Specification + +_by byuu. Public domain._ https://byuu.org/ + +BPS patches encode the differences between two files. + + string "BPS1" + number source-size + number target-size + number metadata-size + string metadata[metadata-size] + repeat { + number action | ((length - 1) << 2) + action 0: SourceRead { + } + action 1: TargetRead { + byte[] length + } + action 2: SourceCopy { + number negative | (abs(offset) << 1) + } + action 3: TargetCopy { + number negative | (abs(offset) << 1) + } + } + uint32 source-checksum + uint32 target-checksum + uint32 patch-checksum + +## Synonyms + +A few terms are used interchangeably as appropriate. + +* source = original file (input) +* target = modified file (output) +* number = variable-length integer encoding + +## Linear creation + +Rather than encoding a list of changes to files (insert data here, delete data +here, modify data here, ...); beat patches encode the steps needed to create a +new file from an old file. That is to say, the target file starts off as an +empty, zero-byte file. The patch commands tell us how to write each and every +byte of the target file sequentially. + +## Variable-length number encoding + +Rather than limit the maximum file size supported to 16MB (24-bit) or 4GB +(32-bit), beat patches use a variable-length encoding to support any number of +bits, and thus, any possible file size. + +The basic idea is that we encode the lowest seven bits of the number, and then +the eighth bit of each byte is a flag to say whether the full number has been +represented or not. If set, this is the last byte of the number. If not, then +we shift out the low seven bits and repeat until the number is fully encoded. + +One last optimization is to subtract one after each encode. Without this, one +could encode '1' with 0x81 or 0x01 0x80, producing an ambiguity. + +Decoding is the inverse of the above process. + +Below are C++ implementations of this idea. Note that we are using uint64 for +the data type: this will limit beat patches created with these algorithms to +64-bit file sizes. If 128-bit integers were available, they could be used +instead. Of course, it's silly to even imagine patching a file larger than 16 +exabytes, but beat does allow it. + +**Encoding** + + void encode(uint64 data) { + while(true) { + uint8 x = data & 0x7f; + data >>= 7; + if(data == 0) { + write(0x80 | x); + break; + } + write(x); + data--; + } + } + +**Decoding** + + uint64 decode() { + uint64 data = 0, shift = 1; + while(true) { + uint8 x = read(); + data += (x & 0x7f) * shift; + if(x & 0x80) break; + shift <<= 7; + data += shift; + } + return data; + } + +-------------------------------------------------------------------------------- + +## Header + +First, we have the file format marker, "BPS1". We then encode the source and +target file sizes. Next, we encode optional metadata. If no metadata is present, +store an encoded zero here (0x80 per above.) Otherwise, specify the length of +the metadata. + +Note that officially, metadata should be XML version 1.0 encoding UTF-8 data, +and the metadata-size specifies the actual length. As in, there is no +null-terminator after the metadata. However, the actual contents here are +entirely domain-specific, so literally anything can go here and the patch will +still be considered valid. + +## Transfer lengths + +We store lengths as length - 1 to prevent ambiguities. There is no sense in +encoding a command that ultimately does nothing. This also slightly helps with +patch size reduction in some cases. + +## Relative offsets + +beat patches keep track of the current file offsets in both the source and +target files separately. Reading from either increments their respective offsets +automatically. + +As such, offsets are encoded relatively to the current positions. These offsets +can move the read cursors forward or backward. To support negative numbers with +variable-integer encoding requires us to store the negative flag as the lowest +bit, followed by the absolute value (eg abs(-1) = 1) + +Note, and this is very important, for obvious reasons you cannot read from +before the start or after the end of the file. Further, you cannot read beyond +the current target write output offsets, as that data is not yet available. +Attempting to do so instantly makes the patch invalid and will abort patching +entirely. + +**outputOffset:** this is a value that starts at zero. Every time a byte is +written to the target file, this offset is incremented by one. + +**sourceRelativeOffset:** this is a value that starts at zero. SourceCopy will +adjust this value by a signed amount, and then increment the value by one for +each read performed by said command. This value can never be less than zero, or +greater than or equal to the source file size. + +**targetRelativeOffset:** this is a value that starts at zero. TargetCopy will +adjust this value by a signed amount, and then increment the value by one for +each read performed by said command. This value can never be less than zero, or +greater than or equal to the outputOffset. + +## Repeat + +Commands repeat until the end of the patch. This can be detected by testing the +patch read location, and stopping when offset() >= size() - 12. Where 12 is the +number of bytes in the patch footer. + + void action() { + uint64 data = decode(); + uint64 command = data & 3; + uint64 length = (data >> 2) + 1; + } + +## SourceRead + +This command copies bytes from the source file to the target file. Since both +the patch creator and applier will have access to the entire source file, the +actual bytes to output do not need to be stored here. + +This command is rarely useful in delta patch creation, and is mainly intended to +allow for linear-based patchers. However, at times it can be useful even in +delta patches when data is the same in both source and target files at the same +location. + + void sourceRead() { + while(length--) { + target[outputOffset] = source[outputOffset]; + outputOffset++; + } + } + +## TargetRead + +When a file is modified, new data is thus created. This command can store said +data so that it can be written to the target file. This time, the actual data is +not available to the patch applier, so it is stored directly inside the patch. + + void targetRead() { + while(length--) { + target[outputOffset++] = read(); + } + } + +## SourceCopy + +This command treats the entire source file as a dictionary, similarly to LZ +compression. An offset is supplied to seek the sourceRelativeOffset to the +desired location, and then data is copied from said offset to the target file. + + void sourceCopy() { + uint64 data = decode(); + sourceRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1); + while(length--) { + target[outputOffset++] = source[sourceRelativeOffset++]; + } + } + +## TargetCopy + +This command treats all of the data that has already been written to the target +file as a dictionary. By referencing already written data, we can optimize +repeated data in the target file that does not exist in the source file. + +This can allow for efficient run-length encoding. For instance, say 16MB of +0x00s appear in a row in only the target file. We can use TargetRead to write a +single 0x00. Now we can use TargetCopy to point at this byte, and set the length +to 16MB-1. The effect will be that the target output size grows as the command +runs, thus repeating the data. + + void targetCopy() { + uint64 data = decode(); + targetRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1); + while(length--) { + target[outputOffset++] = target[targetRelativeOffset++]; + } + } + +## Footer + +Checksum information appears at the bottom of the file. The idea is to allow a +patcher to calculate this information as the patch is being produced. The source +checksum verifies that the input file is correct, and the target checksum +verifies that the patch has been applied successfully. Finally, the patch itself +has a checksum. The patch checksum is the checksum of every byte before it. In +other words, it does not include the last four bytes for obvious reasons. This +ensures the patch itself has not been corrupted. + +Note that checksums are stored in CRC32 format. The intention of checksums is to +verify against corruption and mistakenly incorrect files. The idea was to keep +the file format simple, so cryptographically secure hashes were not used here. +If security is a grave concern, SHA256 or better hashes can be stored in the +manifest data. Otherwise, beat is not the right file format for such uses.