Supported File Formats

To check single files or file collections it is necessary to store the computed digests for the original files into message digest files. Normally these digest files are stored together with the original files for example on magnetic tape which are then delivered to a safe place like a bank safe. In the emergency case such backups can be restored. On the Internet you can find digests files predominantly on FTP and other download servers. Normally the digest files are named after the original files by adding a distinctive suffix such as .md5 for MD5 message digest files 

GNU File Format

The most common file format used for message digest files on the Internet is the GNU format. The data is stored as text lines each of which holding a calculated digest for exactly one single original file. The lines are based on a fixed structure, the first element is the digest value followed by a SPACE character, followed by an asterisk * to indicate that it is binary data, and finally followed by the reference to the original file. Normally just the file name without path is used here but the full path could be used as well due to the GNU format. If full file paths such as C:\Temp\leisenfels_48.png are stored here such references might not be resolved on another computer with a different file system structure so that digest checking could not be performed. Therefore only the file names without paths should be stored here. The files generated with the Digester software only contain the names without paths. A sample message digest file leisenfels_48.png.md5 based on the GNU format looks as follows:

       497f82bf72503317c62ecf6b2645a735 *leisenfels_48.png

If the digests for multiple files in a directory are calculated and then stored together into a single file the GNU format allows to do so. The GNU file then contains multiple lines as described above. Also this creation of GNU files per directory is supported by the Digester software although most digest files on the Internet are dedicated to one single original file. Usually GNU digest file names are extended by a characteristic file suffix representing the used algorithm. In the example the created message digest file therefore is named leisenfels_48.png.md5. The Digester software automatically appends these suffixes depending on the configuration so that any name may be used for message digest files. Therefore it is not possible to store BSD and GNU files simultaneously since both formats use the same file naming scheme. The new Digester XML format provides more flexibility than the older file formats GNU/BSD do.

BSD File Format

Another widely used file format for message digest files on the Internet is the BSD file format. As with the GNU files, the digest values are stored as separate text lines for each original file while the line format is slightly different:

       MD5 (leisenfels_48.png) = 497f82bf72503317c62ecf6b2645a735

The first element denotes the algorithm used for the digest calculation (here MD5), after the SPACE character follows the reference to the original file in parentheses which is followed by the equal sign and the hexadecimal digest value. Normally just the file name without path is used here but the full path could be used as well due to the BSD format. If full file paths such as C:\Temp\leisenfels_48.png are stored here such references might not be resolved on another computer with a different file system structure so that digest checking could not be performed. Therefore only the file names without paths should be stored here. The files generated with the Digester software only contain the names without paths.

If digests for multiple original files shall be stored into one single message digest file the same rules apply as for the GNU files (1 line per original file). The line structure of the BSD file format is more attached to the mathematical description of hash functions. For example a typical line can be translated like this: apply the hash function MD5 to the parameter file leisenfels_48.pn and the result of this operation equals the value X.

Usually also BSD digest file names are extended by a characteristic file suffix representing the used algorithm. In the example the created message digest file therefore is named leisenfels_48.png.md5. The Digester software automatically appends these suffixes depending on the configuration so that any name may be used for message digest files. Therefore it is not possible to store BSD and GNU files simultaneously since both formats use the same file naming scheme. The new Digester XML format provides more flexibility than the older file formats GNU/BSD do.

OpenPGP File Format

OpenPGP signatures are by far the most reliable mechanisms to check files from the Internet, since this allows for both checking the integrity of a file as well as determining the origin of the file. Each signature based on the OpenPGP standard is generated with one or more secret keys, the key identifiers are included in each signature file. For the overall comparison of signatures the public key of the signature's producer is required. Public keys are usually downloaded from the website of the respective download provider and can be easily imported into the Digester software.

A typical signature file in accordance with the OpenPGP standars looks like as follows:

-----BEGIN PGP SIGNATURE-----
Version: Digester 1.6.3

iEYEABECAAYFAk/4NgcACgkQJFLzC/zoww/htgCfY0Denq1XxXT83PGa9LcdAkeC57AAoI7YXdmme2SrI5JrpxT/ueUwn3UT
=civA
-----END PGP SIGNATURE-----

Signature files may be created either in ASCII format (see listing) or in binary format. Binary files have the advantage that less space is needed for storage. Inside the Digester XML files OpenPGP signatures are always stored in ASCII format and not as binary data since this may cause some problems. Unlike the other formats presented here, each signature file contains only one signature, since signature files with multiple signatures are not possible here. The exact format of OpenPGP signatures is defined by RFC 4880.

Typical extensions for signature files are .asc  and .sig, but you may configure arbitrary file extensions within the Digester software to be generated. To label generated signature files, a single-line text can be configured to be stored in the header of the ASCII signatures. Here, you can place the name of your company or organization for example.

Digester XML Format

Due to the limitations described above you should neither use the GNU nor the BSD file format to store message digests. But these older formats can sometimes be a good solution for files on FTP servers since customers are often better prepared for GNU/BSD files. Wherever possible the new XML-based file format should be preferred which can be used very efficiently together with the Digester software. Here you can also create single digest files for each original file but the digests can also be bundled together in arbitrary combinations even over drive boundaries.

Thus, with the Digester XML format it is possible to store digest information for complete directory trees of a hard drive into just a single message digest file which allows a much clearer view than thousands of digest files. In addition such files can be stored on arbitrary drives since absolute file paths can be included. Here you can see an example of a Digester message digest file in XML format jnlp-1_5-mr-spec.pdf.digest):

<?xml version="1.0" encoding="UTF-8"?>
<summary version="1.1" date="Sat Jul 07 15:13:43 CEST 2012" targets="1">
  <comment>Created with Leisenfels Digester</comment>
  <target relpath="jnlp-1_5-mr-spec.pdf" length="227496"
          modified="Tue May 29 10:56:56 CEST 2012" digests="6" pgpsigs="1">
      <digest algorithm="SHA-1" size="20"
              format="hex">9346cb18d8cf8d5066e661dc468540b14275285f</digest>
      <digest algorithm="SHA-1" size="20" pos="8192"
              format="hex">0ad56058ddd6fba142c38f4552bf4fe16f93103f</digest>
      <digest algorithm="SHA-1" size="20" pos="16384"
              format="hex">68bb5167eb62667d3f5ea9bfd529e438daed0586</digest>
      <digest algorithm="SHA-1" size="20" pos="32768"
              format="hex">36b31b4fe0a50557bf66f1976eb2c07b5228616f</digest>
      <digest algorithm="SHA-1" size="20" pos="65536"
              format="hex">57010aa5db867eecc6553b5842f04f158df09cb2</digest>
      <digest algorithm="SHA-1" size="20" pos="131072"
              format="hex">128e90d30b816addb24b1106195accdf78952a6a</digest>
      <pgpsig keyid="0x2452F30BFCE8C30F" keyname="Leisenfels Development"
              keyemail="devel@leisenfels.com"
      keyurl="http://ftp.leisenfels.com/security/pgpkeys/devel@leisenfels.com.asc"
              size="194" format="ascii">-----BEGIN PGP SIGNATURE-----
Version: Digester 1.6.3

iEYEABECAAYFAk/4NgcACgkQJFLzC/zoww/htgCfY0Denq1XxXT83PGa9LcdAkeC
57AAoI7YXdmme2SrI5JrpxT/ueUwn3UT
=civA
-----END PGP SIGNATURE-----</pgpsig>
   </target>
</summary>

As you can see, also these XML files are simple text files compatible with most text editors or text processing programs. The XML format has the advantage that a wide range of character encodings (character sets) can be used for the data (in this case UTF-8). For example also Japanese characters can be used for file names. Every XML file consists of so-called tags which are expressions in angle brackets, while each tag is closed by a second tag of the same name and the leading "/" character. In between the associated tags either data or other structuring tags can be found. Tags can also be closed by inserting the "/" character directly at the end of the opening tag so that the second tag can be omitted (short form of writing).

The XML files refer to the document format summary version 1.0 or 1.1. The time and date of writing the message digest file follows within the date attribute. Equipped with this data field such XML files can be easily moved or copied while the timestamp of the copied file is normally lost. For this reason the time of writing is documented separately within each digest file.

The number of original files referenced by the digest file is provided by the targets attribute since an arbitrary number of original files can be referenced here. The next XML tag comment contains a description to identify the file. This text can be configured with the Digester software so that notes for the FTP server or terms of use can be stored.

The digest values are stored with the target attribute which can be repeated any number of times, the example shows that only the file jnlp-1_5-mr-spec.pdf had been calculated. The relpath attribute (short for relative path) contains the plain file name without any paths for the computer as stored into GNU/BSD files. Such references may generally be resolved until the Digester XML file stays in the same directory as the original files. In addition the Digester software can be configured to store the abspath attribute (short for absolute path). Here the full access paths to the original files such as C:\Temp\jnlp-1_5-mr-spec.pdf can be found.

Since absolute references can be resolved at least on the computer on which the Digester XML file was created, the digest files can be moved to any place on the same computer so that the data can still be processed correctly. So you could for example let the Digester software scan all operating system drives in a single pass, calculate the digests as configured, and store the overall results including all digests and additional data into one single Digester XML file. This allows for a good overview because digest files are not written for each original file or for each directory as necessary for the GNU/BSD format.

An important detail about the original files is the file length in bytes. This information may be an indicator whether an examination of the contained digests should be started at all. For example you could decide that the original files on an FTP server only need to be recalculated if either the timestamp or the length of the file has changed. This additional information which is only available within Digester XML files may help to avoid unnecessary digest checks or calculations. This can imply significant speed advantages especially when checking original files on remote servers against mirrored local examinee files.

Like the date attribute the modified attribute contains a timestamp but here the timestamp of the original file at the time of calculating the digests is remembered. As already described above, such additional information can help to decide whether checking must be performed or not. By placing this date within the digest file changes of the original files can be detected very easily. Furthermore, timestamps at the level of the operating system are usually lost when original files are copied from one drive to another.

The digests attribute counts the number of digests following. Compared with GNU/BSD files Digester XML files can contain digests calculated by algorithms in any combination. Thus, for example MD5 and SHA-1 digests can be stored together into a Digester XML file.

Now to the digests in the digest tags. The algorithm attribute describes the used hash algorithm (here MD5 and SHA-1). Each digest tag contains the information about exactly one digest, so the example shows two such sections. The next attribute size denotes the length of the digest value in bytes. This information is important if the data from a Digester XML file is imported into a database with fixed length fields. The format attribute denotes the format of the digest, currently the Digester software supports the values "hex" for hexadecimal digests and "base64" for digests in the Base64 encoding. Base64 for example is often used to encode e-mail attachments, normally digests should be stored using the hexadecimal format. Right before the closing digest tag follows the digest value. Due to the "hex" format setting the example shows the same value as for the GNU/BSD files which would be different if the "base64" format would have been activated.

In addition to digests such as MD5, SHA-1 etc. also OpenPGP signatures may be stored within the Digester XML format (pgpsig element). In addition to the signature itself which is always stored in ASCII format, a variety of information can be stored to identify the used OpenPGP keys. Within the Digester software, the following key attributes may be selected to be added: 

Another advantage of the Digester XML format is the capability to store intermediate digests which can be efficiently processed with the Digester software. While GNU/BSD files only can contain one single digest for each original file, the use of Digester XML files allows to store as many intermediate digests as desired. The above shown Digester XML file contains only the final message digest value as a result from processing all bytes of the original file. The Digester software can now be configured to create interim results at defined positions as intermediate digests which can then be stored along with the exact position within the Digester XML files.

This has the advantage that deviations of the original file and the examinee file often can be detected more quickly than if the entire examinee file has to be processed. Here you can take advantage of the effect that examinee files often differ from the original files already in the first data blocks. Especially when checking files on remote servers this can have substantial speed advantages since less data must be transferred or be processed.

The following example shows a Digester XML file with intermediate digests that have been created after 1 kilobyte (pos=1024). The mode for the intermediate digest creation can be configured within the Digester software. Currently you can create intermediate digests linearly in fixed intervals (for example every 10 megabytes) or exponentially (for example starting with 16 kilobytes, then 32 kilobytes, 64 kilobytes etc.). For both intermediate digest generation types you can configure a maximum number of intermediates so that the Digester XML files cannot grow uncontrolled.

<?xml version="1.0" encoding="UTF-8"?>
<summary version="1.0" date="Sat Feb 07 17:10:18 CET 2009" targets="1">
   <comment>Created with Leisenfels Digester</comment>
    <target relpath="leisenfels_48.png" length="1906"
           modified="Sun Jul 13 16:51:04 CEST 2008" digests="4">
        <digest algorithm="MD5" size="16"
                format="hex">497f82bf72503317c62ecf6b2645a735</digest>
        <digest algorithm="MD5" size="16" pos="1024"
                format="hex">13e5314fc870d2ec9ef84515cc34abdc</digest>
        <digest algorithm="SHA-1" size="20"
                format="hex">6e09d8f73240cefcaf15167ce3749d8e99ecba63</digest>
        <digest algorithm="SHA-1" size="20" pos="1024"
                format="hex">59877ebbf5854b8ac798b46a7ab5fd80339eacd2</digest>
   </target>
</summary>

Unlike for GNU/BSD files the extension of Digester XML files is not determined by the used algorithm since any algorithms can be used in arbitrary combinations. The Digester software allows to configure the default extension (defaults to .digest). You can also specify the .xml extension for instance which is normally used for XML files.

For the DTD (Document Type Definition) of the Digester XML format see section DTD Digester XML.

Additional Links