mirror of
https://github.com/Alcaro/Flips.git
synced 2026-03-22 01:54:26 -05:00
Add BPS spec
This commit is contained in:
parent
a2184545b5
commit
2be3f60b48
238
bps_spec.md
Normal file
238
bps_spec.md
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
# BPS File Format Specification
|
||||
|
||||
_by byuu. Public domain._ https://byuu.org/
|
||||
|
||||
BPS patches encode the differences between two files.
|
||||
|
||||
string "BPS1"
|
||||
number source-size
|
||||
number target-size
|
||||
number metadata-size
|
||||
string metadata[metadata-size]
|
||||
repeat {
|
||||
number action | ((length - 1) << 2)
|
||||
action 0: SourceRead {
|
||||
}
|
||||
action 1: TargetRead {
|
||||
byte[] length
|
||||
}
|
||||
action 2: SourceCopy {
|
||||
number negative | (abs(offset) << 1)
|
||||
}
|
||||
action 3: TargetCopy {
|
||||
number negative | (abs(offset) << 1)
|
||||
}
|
||||
}
|
||||
uint32 source-checksum
|
||||
uint32 target-checksum
|
||||
uint32 patch-checksum
|
||||
|
||||
## Synonyms
|
||||
|
||||
A few terms are used interchangeably as appropriate.
|
||||
|
||||
* source = original file (input)
|
||||
* target = modified file (output)
|
||||
* number = variable-length integer encoding
|
||||
|
||||
## Linear creation
|
||||
|
||||
Rather than encoding a list of changes to files (insert data here, delete data
|
||||
here, modify data here, ...); beat patches encode the steps needed to create a
|
||||
new file from an old file. That is to say, the target file starts off as an
|
||||
empty, zero-byte file. The patch commands tell us how to write each and every
|
||||
byte of the target file sequentially.
|
||||
|
||||
## Variable-length number encoding
|
||||
|
||||
Rather than limit the maximum file size supported to 16MB (24-bit) or 4GB
|
||||
(32-bit), beat patches use a variable-length encoding to support any number of
|
||||
bits, and thus, any possible file size.
|
||||
|
||||
The basic idea is that we encode the lowest seven bits of the number, and then
|
||||
the eighth bit of each byte is a flag to say whether the full number has been
|
||||
represented or not. If set, this is the last byte of the number. If not, then
|
||||
we shift out the low seven bits and repeat until the number is fully encoded.
|
||||
|
||||
One last optimization is to subtract one after each encode. Without this, one
|
||||
could encode '1' with 0x81 or 0x01 0x80, producing an ambiguity.
|
||||
|
||||
Decoding is the inverse of the above process.
|
||||
|
||||
Below are C++ implementations of this idea. Note that we are using uint64 for
|
||||
the data type: this will limit beat patches created with these algorithms to
|
||||
64-bit file sizes. If 128-bit integers were available, they could be used
|
||||
instead. Of course, it's silly to even imagine patching a file larger than 16
|
||||
exabytes, but beat does allow it.
|
||||
|
||||
**Encoding**
|
||||
|
||||
void encode(uint64 data) {
|
||||
while(true) {
|
||||
uint8 x = data & 0x7f;
|
||||
data >>= 7;
|
||||
if(data == 0) {
|
||||
write(0x80 | x);
|
||||
break;
|
||||
}
|
||||
write(x);
|
||||
data--;
|
||||
}
|
||||
}
|
||||
|
||||
**Decoding**
|
||||
|
||||
uint64 decode() {
|
||||
uint64 data = 0, shift = 1;
|
||||
while(true) {
|
||||
uint8 x = read();
|
||||
data += (x & 0x7f) * shift;
|
||||
if(x & 0x80) break;
|
||||
shift <<= 7;
|
||||
data += shift;
|
||||
}
|
||||
return data;
|
||||
}
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
## Header
|
||||
|
||||
First, we have the file format marker, "BPS1". We then encode the source and
|
||||
target file sizes. Next, we encode optional metadata. If no metadata is present,
|
||||
store an encoded zero here (0x80 per above.) Otherwise, specify the length of
|
||||
the metadata.
|
||||
|
||||
Note that officially, metadata should be XML version 1.0 encoding UTF-8 data,
|
||||
and the metadata-size specifies the actual length. As in, there is no
|
||||
null-terminator after the metadata. However, the actual contents here are
|
||||
entirely domain-specific, so literally anything can go here and the patch will
|
||||
still be considered valid.
|
||||
|
||||
## Transfer lengths
|
||||
|
||||
We store lengths as length - 1 to prevent ambiguities. There is no sense in
|
||||
encoding a command that ultimately does nothing. This also slightly helps with
|
||||
patch size reduction in some cases.
|
||||
|
||||
## Relative offsets
|
||||
|
||||
beat patches keep track of the current file offsets in both the source and
|
||||
target files separately. Reading from either increments their respective offsets
|
||||
automatically.
|
||||
|
||||
As such, offsets are encoded relatively to the current positions. These offsets
|
||||
can move the read cursors forward or backward. To support negative numbers with
|
||||
variable-integer encoding requires us to store the negative flag as the lowest
|
||||
bit, followed by the absolute value (eg abs(-1) = 1)
|
||||
|
||||
Note, and this is very important, for obvious reasons you cannot read from
|
||||
before the start or after the end of the file. Further, you cannot read beyond
|
||||
the current target write output offsets, as that data is not yet available.
|
||||
Attempting to do so instantly makes the patch invalid and will abort patching
|
||||
entirely.
|
||||
|
||||
**outputOffset:** this is a value that starts at zero. Every time a byte is
|
||||
written to the target file, this offset is incremented by one.
|
||||
|
||||
**sourceRelativeOffset:** this is a value that starts at zero. SourceCopy will
|
||||
adjust this value by a signed amount, and then increment the value by one for
|
||||
each read performed by said command. This value can never be less than zero, or
|
||||
greater than or equal to the source file size.
|
||||
|
||||
**targetRelativeOffset:** this is a value that starts at zero. TargetCopy will
|
||||
adjust this value by a signed amount, and then increment the value by one for
|
||||
each read performed by said command. This value can never be less than zero, or
|
||||
greater than or equal to the outputOffset.
|
||||
|
||||
## Repeat
|
||||
|
||||
Commands repeat until the end of the patch. This can be detected by testing the
|
||||
patch read location, and stopping when offset() >= size() - 12. Where 12 is the
|
||||
number of bytes in the patch footer.
|
||||
|
||||
void action() {
|
||||
uint64 data = decode();
|
||||
uint64 command = data & 3;
|
||||
uint64 length = (data >> 2) + 1;
|
||||
}
|
||||
|
||||
## SourceRead
|
||||
|
||||
This command copies bytes from the source file to the target file. Since both
|
||||
the patch creator and applier will have access to the entire source file, the
|
||||
actual bytes to output do not need to be stored here.
|
||||
|
||||
This command is rarely useful in delta patch creation, and is mainly intended to
|
||||
allow for linear-based patchers. However, at times it can be useful even in
|
||||
delta patches when data is the same in both source and target files at the same
|
||||
location.
|
||||
|
||||
void sourceRead() {
|
||||
while(length--) {
|
||||
target[outputOffset] = source[outputOffset];
|
||||
outputOffset++;
|
||||
}
|
||||
}
|
||||
|
||||
## TargetRead
|
||||
|
||||
When a file is modified, new data is thus created. This command can store said
|
||||
data so that it can be written to the target file. This time, the actual data is
|
||||
not available to the patch applier, so it is stored directly inside the patch.
|
||||
|
||||
void targetRead() {
|
||||
while(length--) {
|
||||
target[outputOffset++] = read();
|
||||
}
|
||||
}
|
||||
|
||||
## SourceCopy
|
||||
|
||||
This command treats the entire source file as a dictionary, similarly to LZ
|
||||
compression. An offset is supplied to seek the sourceRelativeOffset to the
|
||||
desired location, and then data is copied from said offset to the target file.
|
||||
|
||||
void sourceCopy() {
|
||||
uint64 data = decode();
|
||||
sourceRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1);
|
||||
while(length--) {
|
||||
target[outputOffset++] = source[sourceRelativeOffset++];
|
||||
}
|
||||
}
|
||||
|
||||
## TargetCopy
|
||||
|
||||
This command treats all of the data that has already been written to the target
|
||||
file as a dictionary. By referencing already written data, we can optimize
|
||||
repeated data in the target file that does not exist in the source file.
|
||||
|
||||
This can allow for efficient run-length encoding. For instance, say 16MB of
|
||||
0x00s appear in a row in only the target file. We can use TargetRead to write a
|
||||
single 0x00. Now we can use TargetCopy to point at this byte, and set the length
|
||||
to 16MB-1. The effect will be that the target output size grows as the command
|
||||
runs, thus repeating the data.
|
||||
|
||||
void targetCopy() {
|
||||
uint64 data = decode();
|
||||
targetRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1);
|
||||
while(length--) {
|
||||
target[outputOffset++] = target[targetRelativeOffset++];
|
||||
}
|
||||
}
|
||||
|
||||
## Footer
|
||||
|
||||
Checksum information appears at the bottom of the file. The idea is to allow a
|
||||
patcher to calculate this information as the patch is being produced. The source
|
||||
checksum verifies that the input file is correct, and the target checksum
|
||||
verifies that the patch has been applied successfully. Finally, the patch itself
|
||||
has a checksum. The patch checksum is the checksum of every byte before it. In
|
||||
other words, it does not include the last four bytes for obvious reasons. This
|
||||
ensures the patch itself has not been corrupted.
|
||||
|
||||
Note that checksums are stored in CRC32 format. The intention of checksums is to
|
||||
verify against corruption and mistakenly incorrect files. The idea was to keep
|
||||
the file format simple, so cryptographically secure hashes were not used here.
|
||||
If security is a grave concern, SHA256 or better hashes can be stored in the
|
||||
manifest data. Otherwise, beat is not the right file format for such uses.
|
||||
Loading…
Reference in New Issue
Block a user