Binary File Format Compatibility

../Pictures/sunshade_blue-xxx.jpg
Most of the software applications I have worked on have demonstrated the following properties:
  • the application has a longevity far beyond the original forecast,
  • the binary file formats change dramatically over time,
  • these binary files can be shared between different versions of the same application,
  • more and more binary file handling code is developed to handle the different file formats,
  • sooner or later, applications break trying to read the various binary file versions.
This article proposes a method to prevent such breakages.
First I will present a scenario, one that I have been faced with on three occasions, which shows the sort of nightmare that can occur. Then I will discuss the solutions I have personally used to reduce this nightmare to something manageable, and which lets you sleep at night.
I don't claim any rights over these techniques, some other very well known applications use similar structures. Nor do I pretend that they are 'best practices', they are simply 'good enough for me'.

The Scenario

You take on a job to develop a Windows based program for a company that has previously developed scientific equipment on a proprietary platform. The project involves connecting a PC to the proprietary hardware, and storing the waveform data to hard disk. The system also requires a database, and is connected to a LAN so that acquired data can be viewed on a normal PC.
Five years later, you're still working part-time on the contract, you're into the tenth release, and now there are three proprietary platforms to connect to, the waveform data can be 8 bit or 16 bit, and from four to twenty channels. Some of the first release software is still in use.
One of the proprietary systems stores the waveform data locally, which is then downloaded to the PC, so it also uses your binary file format libraries. The PC software has evolved a sophisticated backup system which stores the waveforms and database data to CD-ROM, and can be retrieved by advising the user which CD-ROM is needed to retrieve the data. The database structure has had to evolve to take on ten new fields in two new tables.
External companies have developed extensive analysis and reporting software using your binary file format libraries to read in the raw data. In two months time the latest hardware product will be released which stores 32 channels of data with 32 bit resolution.
So what magic did you use to avoid compatibility nightmares? How do you know you're not going to break all those applications (especially those developed independently) when you change the binary file format yet again?
The next section presents some solutions to these difficult problems, but first, a few words of warning.
There is no 'all terrain' solution to the problem. In the scenario just described it won't be possible for old 8 bit readers (the first release) to handle 32 bit resolution (the eleventh release.) It maybe that you were forward thinking enough to store the original data in 8 bits and represent the data internally to the application as 16 bits, but that's as far as it goes. We just want to make sure that release one software doesn't die trying to read the new format, and that release eleven can still read release one binary files. Even though release one will not be able to read the release eleven waveform data, it should still be able to read the database fields and other data types in the binary file format which have not radically changed. Finally, we want the release one application to tell the user that it can't read the waveform data, and tell him or her exactly which release they need to upgrade to.

The Macro Solution.

There are two basic techniques which can be used to solve such a binary file format problem. Firstly, we can distinguish between different types of data – the macro solution. In our scenario we have to store database data, and waveform data. Perhaps tomorrow we'll have to store audio data too. Secondly, we can determine the raw data file format for each data type – the micro solution, which I'll cover later.
Don't be confused about the database data, that goes in the database of course, probably using SQL. Here I'm talking about the backup or storage binary file format. If user A stores a waveform file, with database information and text annotations, and then sends it to user B for analysis, then the database information and text annotations need to be read from the file format, because they won't be in user B's database.
The macro solution uses a compound file approach. It is probable that each separate data type is viewed individually, so the application will want to pick out only those elements which are needed for the view. Some of these data types may already have a 'standard' binary file format, such as an image or audio file.
What is the de-facto standard cross-platform compound file format, with freely available libraries for most major programming languages? Answer: the Zip file format, developed by the late Phil Katz, distributed by PKWare. This file format, which is both mature and stable, allows you to bundle binary or text files in multiple directories, and additionally allows those files to be compressed, which reduces disk space consumption.
Modern examples of the use of this file format are the Java Archive file (jar) which uses the Zip format to store manifest information, and numerous class files. Similarly OpenOffice.org uses this file format to store a document's styles, content, and external binary files, such as images and applets.
Some programming languages, such as Java, provide Zip file handling as part of their standard libraries. For other languages, such as C#, C and C++, there are both open source and commercial libraries available. This Google search is a good starting place.
The next step is to define a file structure for all the binary (and text) files that make up the data of a single measurement. In our scenario, release one started with just two elements; the database fields, and the waveform data. By version eleven, there may also be pixel images (say jpeg), and audio files too.
The directory structure of the Zip file format is extremely helpful when designing an overall binary file format, because directories can so easily be extended with nested sub-directories. One important point is to always supply a header or manifest which can be read quickly, without having to read large file streams, and which supplies versioning information to the application. For example in version one we would have:
  /version.inf
  /db/record.rec
  /data/waveform.sig
The application can read /version.inf to quickly establish which software modules and data structures will be needed to read the files. Probably /version.inf will contain versioning information for each data type, for example:
  db.version=1.0.0
  data.version=1.0.0
If you add well known, stable and mature file formats, such as .png or .jpeg, you probably wouldn't put versioning information in your header, but you might be well advised to at least add a MIME descriptor which the application can recognise:
  .jpeg=image/jpeg
  .png=image/png
so that the application knows what to expect, including the file extensions.

From Release One to Release Eleven

What does this format look like by release eleven? Perhaps the /version.inf file now includes information about the waveform data size, sampling frequency, and number of channels, and the new notes function (see below):
  db.version=2.3.1
  data.version=3.1.4
  notes.version=1.1.0
  data.size=16
  data.frequency=1024
  data.channels=20
For each new item in the header file a sensible default value should be chosen which can be used when this information is missing in earlier file formats, for example:
  notes.version=1.0.0
  data.size=8
  data.frequency=256
  data.channels=4
Text annotations or notes can now be added to the waveform data file, and the waveform data file has been split into segments each 60 seconds long:
  /version.inf
  /db/record.rec
  /data/wf0000.sig
  /data/wf0001.sig
  /data/wf0002.sig
  /data/wf0003.sig
  /data/wf0004.sig
  /data/wf0005.sig
  /data/wf0006.sig
  /data/wf0007.sig
  /data/wf0008.sig
  /notes/notes.rec
Clearly, this macro solution can be extended in the future by adding items to the header or manifest, and by creating new directories or even sub-directories. Once the application has read the header, it knows what other file format modules it will need to use, and also where it expects to find the files in the zip file itself.
Be warned though, in this example I deliberately changed the number and name of the waveform data files. This will be incompatible with previous releases of the software, which had expected a single differently named file. This does happen in real life, perhaps the file size became difficult to handle when long samples were taken, or that the increased sampling size and rate in later models made the same maximum timespan too big to handle.
At least the earlier released software can tell the user that he or she will need version 3.1 or later to see the data correctly. Of course, release eleven can handle release one file formats.

The Micro Solution

The next stage of the solution works at the single file level. Each data file will contain a series of data structures.
Here we will look at two examples; the database file, and the waveform file, which we'll look in a moment. In the database file we want to store all the fields and their values. We could use a text file, similar to the manifest file of the previous section, but let's say we have problems that the user can store newlines in description fields, so we wouldn't know when the field value ends correctly.
We solve the problem by using a tagging mechanism. We define a tag as being a unique integer identifier for each file, followed by a field length, followed by the field data:
Identifier
Length
Data...
The value of the identifier isn't important, just as long as it is unique to the specific file. The programming language you use can help here, for example an enum in C, C++ or C#, and in Java 1.5, or you can define constants.
So, from the database we store an acquisition machine number, the date, the sample title, and a description, each containing it's own data:
Machine
1
7
'IRQ2244'
Date
2
14
'10032004175203'
Title
3
11
'Test sample'
Description
4
61
'Test using a sine wave generator'
'John Leach Chief technician'
For release one we can set the value 4 as the last known identifier for the database.
The waveform data consists of a header, and a sequence of raw data blocks. The header is a partial repetition of the manifest described before, but also specifying the block length in samples, and the total number of samples:
Bits
1
1
8
Frequency
2
2
256
Channels
3
1
32
Block length
4
2
4
Total
5
4
1123
The raw data block has just one tag which defines it's length:
Data block
6
8192
Data...
and would be repeated 1123 times.
In this case, for the waveform data the value of the last known identifier is 6.
So what do we achieve by 'wrapping' each data element? We achieve isolation. Any release of the application can now walk along the data, reading the identifier and length, and if it recognises the identifier it can read in the data. If it does not recognise the identifier, it at least knows the data length, which it can skip over to the next data element.
This effectively provides both forward and backward binary compatibility, because older applications will simply ignore newer data elements, but still retain and display known data elements. Similarly, newer applications can ignore older tags if necessary, perhaps where the tag has become redundant or insignificant. The newer applications can also generate missing expected data tags with default values, as we will see next.

From Release One to Release Eleven and Back

So what happens to these binary file formats in later releases? First the database. In a later release we want to add information as to where the data sample has been backed up, and where it has been archived. Backup means that the file remains on disk and in the database, but is also stored on a CD-ROM. Archiving means that the file has been removed from disk an is only available on CD-ROM, though the database record remains.
We need to add four more fields, thus:
Backup CD-ROM
5
8
'BK000022'
Backup filename
6
12
'wf011402.bkf'
Archive CD-ROM
7
8
'AR000003'
Archive filename
8
12
'wf000113.arf'
Since previous releases didn't have these fields we can use the empty string as a default.
Older applications will skip these fields, as it wouldn't know what to do with them anyway, because there was no backup or archiving functionality.
Next the waveform data. In the first release, the user could only capture data in continuous time. However, user feedback showed that a stop start function was required, perhaps to check the cabling. In later releases, a new time data element was added each time the machine (re)started recording:
Start time
7
18
'10032004151733.003'
Older applications won't store this field, so some 'null' default should be used, such as the Unix epoch, or perhaps the empty string.
Newer applications display this information, when present, in the trace view, whereas older applications will still work with the newer waveform file, but will skip over this extra information.
This concludes the article. I hope some of this information will be of use on your next application project.