Thursday, February 16, 2012

Reading bytes in Python

Today's lesson in why you should RTFM, but more on that later.

One of the biggest successes I had while learning C was reading text and binary data from files. I'm going to tackle the same task in Python. The first step is going to be reading in some data (the first 136 bytes) of the file header.

The file I'm reading is the same sort of file I learned to read in C - a file with some header data with various things like the date it was acquired and some comments, and then large chunks of digitized analog data (like a .wav file in a way).

I'm starting small - I only want to read in the first 136 bytes of the file. The first 4 bytes represent an integer that is always in the file (sort of a marker that tells us where it came from). The next 4 bytes is the version of the file format (also an integer), and the next 128 bytes represent a string of 128 characters (which are 1 byte each, so 128 characters).

I've spent a good deal of time prepping for this task - most of the info I needed was in the documentation for file objects and the struct module. In short, I'm going to read in a known number of bytes using the read() method for file objects, and then "unpack" those bytes into a specific format (integers and character strings) using the unpack() function in the struct module. So, here's a start:



I once read somewhere that using a class full of empty instance (self) variables is a good way to mimic how structures in C look. I don't know if that's very "Pythonic" but it works for me. In the PHeader class I've defined three variables that I'm going to fill in with data from the file. The actual code that executes is under "if __name__ == '__main__'", which is just a fancy Python way of saying "if this .py file is run on its own then do what's underneath".

First I open the file as "p", and initialize "s" as an instance of PHeader. I know that p.read(N) will read in N bytes of the file, so I need to somehow tell Python to interpret those four bytes as an integer (as opposed to another data type that is 4 bytes) and then make s.MagicNumber equal that resulting number.

So that's where the struct module's unpack() function comes in. unpack() has this prototype: unpack(fmt, string). fmt is the format of the bytes being read (we want an integer so we pass it "i") and string is the bytes to unpack. Well, the result of p.read(4) is our string, so this line...

s.MagicNumber = unpack('i', (p.read(4)))

...gets our four bytes, interprets that as an integer, and passes the result to s.MagicNumber. A big caveat that I missed while writing this that caused a great deal of confusing is that it doesn't ACTUALLY pass JUST the integer. It passes a Python data type called a tuple with the integer I wanted as the first element of that tuple. Tuples (and other Python data types) work a lot like arrays in other languages - but more on that in a second.

The next line does pretty much the same thing...

s.Version = unpack('i', (p.read(4)))

Ok cool, we have now read two integers from the file and stored them in some variables. This next part is tricky and caused a lot of wailing and gnashing of teeth on my part. The struct documentation told me that "i" is the format character for integers, and "s" is the format character for character arrays (strings). Since the next 128 bytes of the file is a row of 128 characters (a string) I figured I could just replace the "i" with an "s" and then do p.read(128). This was incorrect. After a lot of pondering over error messages I carefully read through the struct module documentation and found that you have to precede the "s" with the number of characters to be read, like "128s". So that resulted in this line...

s.Comment = unpack('128s', (p.read(128)))

...and all was well.

Remember I said that unpack() returns a tuple, and in our case the first element of that tuple is that actual integer or character array we asked for from the file? Getting the first element out of a tuple is a lot like getting the first element out of a C array. If I have a tuple called MyTuple I can get the first element by asking for MyTuple[0]. So the print lines...


 
print('Magic Number: %s') % hex(s.MagicNumber[0])
print('Version: %d') % s.Version[0]
print('Comment: %s') % s.Comment[0]

...do exactly that. Oh - the first line says hex(s.MagicNumber[0]) because I want the integer returned to be printed out as a hexadecimal number.

All said and done that dozen lines of code took about an hour, which isn't bad considering I started out with only a superficial knowledge of how to read bytes. Hopefully the next step of reading the more important data from the file won't be so traumatic now.



No comments:

Post a Comment