Previous Entry Share Next Entry
Reading OpenStreetMap Big Data with D - Part 3
anime
he_the_great
With a decent Protocol Buffer Library it was now possible to create a simple PBF reader. The wiki page about the PBF Format provides a good description of what this data looks like.

At first I was thinking this should be pretty easy as I've got a pbcompiler to build all the parsing for me, thinking that the document wiki was just describing the details of how protocol buffers work. This was not the case and it makes complete sense.

Protocol Buffers are two things. A specification on how to write a schema and a specification for binary encoding. It is not a file format. Certainly like any other data it can be written to a file without any special handling, but protocol buffers don't really meet the needs for most data storage.

The expectation is that a message is self contained. The deserialization of a message expects all of the information for that message exists in in the data given and that no other information as passed. In the case of OpenStreetMaps that would mean your 20 gigabyte planet is going to be read into memory to provide access to the entire message.

To avoid this problem the OpenStreetMap file is defined specifically to allow only messages of interest to be deserialized. Specifically it is a sequence of BlobHeader and Blob. This was actually the first curious choice I've come across.

In order to decode a message we need to know the size of that message, as I mentioned. For PBF the Blob follows the right after, so before every BlobHeader there are 4 bytes which state the number of bytes for the BlobHeader. This is defined to be in network-byte-order. Had this been defined to be a varint as described in the Protocol Buffer format, the BlobHeader would have just become an "inner" message. Remember I was working on a PB parser so I knew that I could easily request that the BlobHeader be read in as though it were just part of a larger message.

The Blob just contains a compressed chunk of OSM data as defined in the osmformat.proto. Later I'll go over the code to help clear up any confusion about how these files are parsed.
https://gist.github.com/JesseKPhillips/6051600

?

Log in