he_the_great (he_the_great) wrote,

Reading OpenStreetMap Big Data with D - Part 4

The PBF format is a binary format for storing OpenStreetMap data. The data primitives are described by the osmformat.proto and the file format uses a Header and Blob combination. Since the file doesn't store a pure Protocol Buffer message there is some special handling to get through all the data.

If your interested in general usage of this data these specifics are not important as there are many libraries and programs to abstract away the specifics I'm going to go over. But if you're interested in such things or want to build some libraries in a new language (without binding to C) then this may be of value.


while(!datastream.empty) {
    if(datastream.bufferLength < 4) {
        // This shows bytes which haven't been read.
        assert(false, "Should parse all bytes?");

Starting off we want to parse the data until there is no more left. The inners of a loop should be parsing everything up to the next BlobHeader. This means we should always have 4 bytes (specifies the length of the BlobHeader in network-byte-order). If at the beginning of this loop there is less than 4 bytes then something is wrong. I actually use a question here because the bremen file I downloaded actually had a single byte 0xa left over after parsing. I can not identify an issue with the parsing, and thus conclude the bremen file is wrong.

auto size = toNative(datastream[0..4]);
bytesCount += 4;

So now we begin our read taking the first four bytes so we know how many will be needed for the BlobHeader (defined in fileformat.proto). The bytes are no longer needed so all 4 are removed with std.range.popFrontN. Also to monitor how many bytes have been read I increase a counter by 4 (this has no relation to parsing the file).

auto osmData = datastream[0..size];
bytesCount += size;
auto header = BlobHeader(osmData);

Again we slice our needed data, remove it from the stream, count the read bytes, and this time we get our Protocol Buffer defined BlobHeader to decode the data.

osmData = datastream[0..header.datasize];
bytesCount += header.datasize;
auto blob = Blob(osmData);

The BlobHeader gives us the information we need to parse the next item, Blob. This is handle exactly the same as before.

    osmData = cast(ubyte[]) uncompress(blob.zlib_data);
else if(!blob.raw.isNull)
    osmData = blob.raw;

Our blob holds either a HeaderBlock or a PrimitiveBlock which are both defined in osmformat.proto. The data can come in several forms, raw (uncompressed) or a compressed form (zlib). The spec mentions an obsolete bzip2 and a proposed lzma. Frankly I think it is too late, zlib is very common and it would be dangerous to support an assortment of compressions.

if(header.type == "OSMHeader") {
    auto osmHeader = HeaderBlock(osmData);

Our HeaderBlock lets us know our Blob is holding an OSMHeader, so we take our extracted data and have Protocol Buffers work their magic on the data. When I ran this against the entire planet there was a total of 1 HeaderBlock. So this is likely a currently unused design choice to support the random-access mentioned.

The header here provides information about the required features for your parsing/library. So it is important to verify the parser understand and can claim to support the feature.

} else if(header.type == "OSMData") {
    auto osmDataBlock = PrimitiveBlock(osmData);

Finally we are past all the meta data and we can start looking at the map data. So far it is just the same old.

auto stringTable = osmDataBlock.stringtable.s.get().
writeln("OSM String: ", stringTable.take(5), "...");

It is guaranteed that a StringTable will exist with the first entry being empty. I'm documenting this with a simple enforce. I then pull out the StringTable. The Protocol Buffer library specifies all types as Nullable!T but I'm not wanting the wrapper so it is removed by calling get().

It should be noted that the StringTable actually stores ubyte[] and not strings, to deal with this I'm mapping the array to create a char[] range. Then print the first 5 strings.

// Nodes can be encoded one of two ways, as a Node and a special
// dense format.
if(!osmDataBlock.primitivegroup.front.nodes.isNull) {
    Node[] nodes = osmDataBlock.primitivegroup.front.nodes;
        writefln("Node Tags: %s...",
          zip(nodes.front.keys.get, nodes.front.vals.get).
          map!(x=>tuple(stringTable[x[0]], stringTable[x[1]])).
          map!(x=> x[0] ~"="~ x[1]).take(2));

Instead of getting at all the data I'm concentrating on the tags. The StringTable that we pulled out earlier is the text for our key/value pair. The nodes (and other primitive data types) store just an index location into that string table.

The std.range.zip function allows me to walk both key and val ranges together. The next step is to turn our indices into actual strings, this could have been done before the zip, it is easy enough to do in a single map. Since I don't want to print the data as a Tuple, I'm also mapping to a string where key=val. And for easy of reading, I'm only going to take the first two tags.

if(!osmDataBlock.primitivegroup.front.dense.isNull) {
    DenseNodes nodes = osmDataBlock.primitivegroup.front.dense;
    if(!nodes.keys_vals.isNull) {
        auto nodeTags = nodes.keys_vals.get().
          map!(x=> x[0] ~"="~ x[1]).take(2);
            writefln("Node Tags: %s...", nodeTags);

Dense nodes, which will be more common, don't have separate key/value. This is done to reduce the extra baggage of byte length which Protocol Buffers specify. Instead all nodes are packed into a single message, their tags are separated by the StringTable index 0, and the data is key followed by val index.

To speak on per Node terms I first use std.algorithm.splitter. I'm not concerned with going over every node at this point, I use front to get the first node. Since val will follow key, this new range is grabbed by chunks. And now the dense node now looks like the normal nodes, just finish it off with our map and take. I didn't want to print an empty range, so I check that it isn't empty first, this can happen since I'm grabbing the first node instead of the first node with tags.

And there is nothing special about the tags used on Ways and Relations. That concludes the layout of the data, there are still further encodings related to the data such as the delta compression used. In my next post I'll talk about reading large files, the file range I put together at the bottom.
Tags: dlang, openstreetmap, programming, protobuf

  • Post a new comment


    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.