Previous Entry Share Next Entry
Reading OpenStreetMap Big Data with D - Part 5
anime
he_the_great
Through the development I had been working with the entire data read into memory.

auto datastream = cast(ubyte[])read("bremen-latest.osm.pbf");

This works well as I could concentrate on looking through the data. I was interested in being able to parse an entire planet. I'm sure one could guess what happened when I downloaded the 20 gig file and asked my program to parse through it. Nonetheless it was still disappointing for it to fail right out of the gate with OutOfMemoryException. Something needed to change.

The D programming language does provide a stream module. I was already well aware this module was up for deprecation. It is still there because a replacement hasn't been created, and I really prefer sticking to ranges. So I created one:
https://gist.github.com/JesseKPhillips/6051600#file-osmpbfexample-d-L201


I needed to be able to read the data in by chunk and also be sure that I was reading in enough so my message wasn't cut off. My existing code handled this by slicing from the start to whatever size I needed.

auto size = toNative(datastream[0..4]);
...
auto osmData = datastream[0..size];

It is fairly easy to read through a file by chunk, it is quite another to slice an input range when all the data may not be in memory. So I built my range to provide slicing, when the buffer did not hold the slice needed it would pull in more data (appending to the existing buffer array) and then return a slice of that data.

auto opSlice(size_t x, size_t y) {
    if(y+index < buff.length)
        return buff[x+index..y+index];
    prime(y+index - buff.length);
    return buff[x+index..y+index];
}

The method opSlice is what defines the [x..y] syntax of a type. I chose to use an index to identify where my range was located in the buffer, as I review this now I could have done a popFront on my buffer. When my ending index surpasses the available buffer length I call a function called prime. The name prime is a convention used to signify code which will be the same among setting up a range and calling popFront. In this case I'm telling prime how many extra bytes I will need (and it will grab at least that many). Once that is complete I can pass along the slice of the buffer.

auto popFront() {
    index++;
    if(index > buff.length - size/2 && !chunks.empty) {
        buff = buff[index..$];
        index = 0;
        prime();
    }
}

Moving forward in the buffer I've chosen to extend the data in the buffer once index is within half the number of bytes which will be pulled. Also there is nothing more to grab if chunks is empty.

Also once the buffer needs to be resized it is indexed back at 0. The buffer can't grow indefinitely so I need to let the GC know which blocks of memory I'm done with.

It should be noted that taking a slice as I have does not create a copy nor will the GC free any of the memory which exists before index. Instead this is relying on the operations in prime.

private void prime(size_t total = size) {
    while(!chunks.empty) {
        buff ~= chunks.front;
        chunks.popFront();
        if(total > size)
            total -= size;
        else
            break;
    }
}

The main operation here is appending the chunk to the buffer and then moving to the next chunk. This append operation is magical. Sometimes it will be a fast change in capacity, other times the buffer will be copied to a new memory location and the chunk copied with it. This behavior combined with the slicing in popFront is what allows unused memory to be freed. For more details read about D Slices. I want to also note that the Protocol Buffer library is also using slices when reading data. I have not verified this, but that would likely mean the StringTable is the same bytes we pull out of our buffer.

This new range allows the original read operation to be a FileRange operation.

auto datastream = FileRange(File("bremen-latest.osm.pbf"));

However the work isn't done.

datastream = datastream[4..$];
...
datastream = datastream[size..$];

Originally I'd moved through the stream by slicing to the end. This was no longer possible with my new range type. I tried to find a way to allow this to work, but with no success. I already defined a slice to be of ubytes, but datastream is a FileRange so I'd need to return a new FileRange for the slice. I could easily define opDollar so there was no issue with not knowing the full length of my range; I just can't figure out a way to mimic the awesomeness of the slice container that are arrays.

datastream.popFrontN(4);
...
datastream.popFrontN(size);

The solution was simply to popFrontN the number of bytes I'd read. Simple enough and I'll consider it as an alternative to slicing the array to the end.

After all this I finally got it working. I decided to remove the print statements as there is a lot of information to print. I let it run to completion and here was the information I got back:

Headers: 1
Datas: 1516
Bytes: 21485717112

real	43m28.156s
user	41m40.788s
sys	0m25.280s

These are by no means benchmark numbers. And the memory usage on my system climbed about 1 gig. If I were to guess most of the time was probably spent on decompression for zlib followed by the GC cleaning up the mess I made. But I don't have any frame of reference to say if 40min is good or bad for reading and decompressing a 20 gig file.

UPDATE: I decided to give this a run removing the zlib decompression:

Bytes: 21485717112

real	5m10.354s
user	3m29.864s
sys	0m21.916s


Conclusion


One of the things I really like about using D is how simple the building blocks can be. Ranges are such an integral part of the D style and they make such a nice glue layer between communication. The standard library still hasn't receive updates in all areas since their introduction. I looking forward to seeing if this generally parsing design will be able to handle some real would tasks, but again I can't say I ever will; it was fun.

This is the last post for this series, I may continue to post new material for which I will use tags to keep topics organized. Thank you for reading.

?

Log in