Donnerstag, 9. Oktober 2014

PostGIS raster and out-db storage

The following is a story about data management, terminology, GIS and frustration.
The story starts with weather data, which is what I deal with most of my working hours. And dealing with weather and climate data is far from trivial. Not only are there large amounts of data (and boy, there are!) which one has to handle, it is usually good to understand how that data was produced and for which location in space and time it applies. The former is not strictly necessary, the latter is essential, which leads me to GIS. GIS means geo information system and is an acronym used for pretty much anything that in one way or the other relates to maps and/or positions on earth (or, for that matter, positions on or maps of any sphere-like shaped body). The fact that the earth is not an exact sphere and the myriads of different ways to express that fact makes dealing with GIS hard. Also, all the different ways of flattening the surface in order to create a 2d representation (i.e. a map) are overwhelming. That is why there is software to make life a little easier, software, that manages to convert from different types of maps back and forth, and to generally do stuff with the spatial data.
Luckily, there is a number of free and open source software packages for GIS, be it desktop applications such as QGIS or GRASS, libraries such as proj.4 or GDAL, or things as complex as an extension to an SQL database to allow spatial data in the context of the relational database (i.e. PostGIS). And, as the title of this post suggests, PostGIS is what I am writing about. The first thing to note is that, generally speaking, PostGIS is a spatial enabled database, i.e. a database that allows to store and manipulate spatial data. Typically, in GIS two main types of spatial data are considered: shapes (or: vector-based, as in computer graphics) data and rasters (or: bitmap-based, as in computer graphics). PostGIS can deal with both types, and it does handle it pretty damn good. For example, you can tell PostGIS to get raster data out of a specific raster at a given geographic position (say, you have a raster with temperature data and you would like to know the temperature at a given longitude/latitude). The good news is that dealing with GIS data is pretty much standardized, there is even a standard for handling GIS data with SQL (and PostGIS follows this standard). And, there are open standards for file formats, which takes away much of the pain that dealing with different competing (or proprietary) standards generates in other areas.
Now let's talk a bit about the actual data. One of the commonly used weather models is called GFS (global forecast system), it is a global model for a wide variety of parameters (temperature, pressure, wind speed, just to name a few), usually more than 300 parameters are computed. The file format in which this data is distributed is called grib, a name that probably should give a hint towards gridded binary data. Despite the fact that this data format is awfully complicated and requires a lot of additional information (so-called tables) about the dataset, which may or may not be available, it is widely used in meteorology. For the GFS, it is indeed not a bad choice, as it supports JPEG2000 compression, which really makes a difference in file size. So, to state a few numbers, one computed time-step of the highest accuracy GFS (i.e. 0.5 degrees horizontal and vertical resolution) with all of its hundreds of parameters needs about 50MB. Note that this is only one time step, i.e. the complete forecast for 8 days in 3-hour steps takes 64 times that amount. Since the GFS is updated four times a day, the complete dataset of one day amounts to about 10GB. This is quite a large amount.
Back to the data handling with PostGIS. PostGIS supports importing raster data from files, so in theory it is indeed possible to simply stick the complete grib file into the database and go on with the really important work. However, in this case, the data compression of the grib files is lost, which increases the required memory by a factor of 5 - 10 (!). Not to mention that the import takes an awkward amount of time. Luckily, there is a simple solution to that problem: PostGIS is able to import raster data without the actual data, by simply referencing the original file instead of importing it. Splendid! However, there is one small drawback in the whole raster-data-into-PostGIS loop: the data format in which the actual data is stored inside PostGIS is not documented, despite the fact that it is called Well-Known-Binary (WKB) format. After some googling the reader may object that there is a well documented standard for that file format. Indeed there is, but it only refers to shape data, not to raster data! I think that the WKB for rasters was created by the PostGIS/raster creators. But since it is pretty much the opposite of well-known, we'll probably never know.
Let's get back to the data for a moment. The grib files that hold the GFS data usually contain the complete parameter set for one time step, meaning it is not really one single raster, but merely a collection of rasters. One raster per parameter, or, to express that in the terminology of GDAL, which is the library that PostGIS uses to handle raster data, one raster band within the raster file per parameter. So, the data storage solution seems obvious: store the raster as references (called out-db raster) in PostGIS! This allows to keep only one copy of the original files, it allows to do all the nice things that PostGIS is designed to do transparently with the data which is actually stored in the grib files, so PostGIS handles the data access and everything is beautiful. Right? Wrong. Taking a very close look at what PostGIS actually stores for out-db rasters (thanks open source!), it appears that the WKB format only allows one byte to store the band number of the data file that is referenced. In other words, PostGIS actually stores some metadata of the raster and the pairs (referenced band number, referenced file name), but the referenced band number is only allowed numbers between 0 and 255 (GDAL starts counting raster bands from one, so that translates to band numbers between 1 and 256). So for band numbers 1-256 that works fine. But larger band numbers cannot be stored in one byte, and even worse, there is an overflow, i.e. it is not only impossible to access band 257 of the original grib file, instead band number 1 is accessed.
Why is that bad? Because there instead of a meaningful error message (or, even better, the correct band), PostGIS happily returns the wrong values when requesting a band that cannot be accessed. If I had not looked into the source code, I would not have spotted this shortcoming, so at some point the data would have been all wrong. And that is bad. To be fair, now that I know what to look for, I found the corresponding issue in the PostGIS raster bug tracker and there is a fix promised for version 2.2, but that is not available yet. And the really annoying thing is: there are only a very inconvenient work-arounds for this bug, namely to split the grib file into smaller grib files containing less than 256 bands or, of course, to import the data itself into the database and live with the 10-fold increase in required disk-memory.

New blog about technical stuff

I am a bit upset about missing/wrong documentation of various software products, so I decided to write a couple of posts about how to handle the difficulties. Or, in my particular case, abandon a simple and elegant approach just because a number can not be larger than 255. Of course it can't. Nobody ever encountered a larger number, d'uh.