Working with File Formats

posted on 2008-08-03

Working with file formats is something very funny (when it works) or frustrating (when it doesn’t). Last format I worked on was PDF. I had several PDF files and I wanted to extract some textual data from them.

First step was to look at it with a text editor : do I need to understand the way the file is stored, or there a way to directly extract the data I need without even understanding the whole thing ? No luck with this, PDF is a binary file format that usually looks like garbage when opened with a text editor.

Second step is then to look for the file format documentation. A bit of google and wikipedia and you can find the following link :

http://www.adobe.com/devnet/pdf/pdf_reference.html

You’ll be able to find here a PDF File Format Reference… in PDF of course. After skipping the introduction blabla, you can directly reach the interesting section which describe the Syntax and the File Structure. Great ! It’s not very often that a file format is officially documented actually, so it’s always nice to find some complete documentation, although additional C source code would be helpful as well.

Anyway, after a bit of coding, I was able to parse my first PDF. It consists in list of “objects” which are referenced by an ID, and depending on their type might content either text, graphics or font data. I was of course interested in the text data, but it was compressed with a so-called “filter” which is basically ZLib compression.

But after trying several zlib parameters to decompress the data, I failed to unzip the text sections. Looking back at the file format, searching for answers, I found that PDF support which is called “encryption”.

The encryption used is the RC4 algorithm, with a key built from different informations present in the PDF, plus an user password that is by default empty. The encryption also contains bit flags that tells which operations people can do with this PDF, like saving, printing, editing… This is one of the most stupid security I have ever seen !

In fact since people are able to open the PDF without entering a password, it means that the PDF can be decrypted without password (aka with the empty password). So it means that it should be possible to very easily remove “encryption” from such a PDF in an automated manner, including modifying the “user rights” on it.

Back in time when this “security” was added, it was surely “security through obscurity” : since nobody knew how to obtain the RC4 key from a given PDF, nobody could read such PDF. But the way the RC4 key is computed is also documented as part of the PDF reference. And this actually is very funny (looks page 125 of the PDF Reference).

Well, since I was not very lucky this time, it turned out the PDF I was trying to read were “encrypted”. So I had to implement this whole nonsense security algorithm… It actually took me almost half a day, because the some value that I had in my PDF reader was wrongly parser (I forgot to handle the minus sign in front of the number) and thus was giving me an invalid RC4 key…

After spending hours trying to make things work, I downloaded some PDF Python library which supported PDF decryption, then run it and added some traces to display the key it was computing. Since it was different from mine, I was able to track my bug by comparing the difference between the computation of the key at the different steps.

And finally, after a day of hardcore coding, it worked ! The text section of the PDF contains some Postscript-like data, but I didn’t need to parse this one, so I instead used some specific regular expressions to extract things that I needed.

Looking at the different formats I have been working on the past years in haXe, and with the recent addition in haXe 2.0 of crossplatform haxe.io.Input/Output haxe.io.Bytes and haxe.Int32, which are the mostly used classes when working with file formats, I decided to group all of these formats together into one single library : hxFormat.

It currently supports FLV and AMF (taken from haxeVideo), ZIP TAR and GZ (taken from neko.zip package) and PDF. I’m planning to work on PE and DMG support as well at some time, since it would be nice to be able to create DMG in a crossplatform way (see my previous post about OSX and its comments). I’m also accepting other people contributions, so I hope the library will grow with more file formats support !

Anyway it’s always nice to see some library that parse a binary file and makes some sense of the garbage that is stored as bytes. It looks like some kind of magic to me. And working with file formats is also a very good way to learn (more) about programming. A file structure, when it’s well designed, give a lot of information about the architecture of the program that read/write it. It’s rare to see a good file format with bad software, and in general good software have good file formats as well.

Leave a Reply

You must be logged in to post a comment.