Montag, 24. November 2014

Regridding spatial data

This post is about one of the most difficult tasks in processing spatial data, in particular gridded data: regridding. Regridding is all about "warping" a given data blob into a different grid. First of all, it is tremendously helpful to really know what grid the original data is in. Here are a few examples:
Regular lon/lat grid:
Given the corner coordinate as a longitude/latitude pair and the per-data-pixel offset in degrees, this is one of the easiest format for gridded spatial data. In terms of projections, this can be expressed as a standard epsg:4326 projection (disregarding subtile differences in the meaning of longitude and latitude depending on the used "datum", i.e. the used model of the earth), meaning that all longitude/latitude coordinate pairs are simply "flattened" to the plane. Dealing with this kind of data is usually pleasant, gdal and other spatial libraries fully support this, no big deal. Except, the notion of longitude/latitude in commonly used data files in the data format "grib" (which is often used in numerical weather models) is slightly (usually neglectible) different from the standard WGS84 datum.
Rotated lon/lat grid:
The regular lon/lat grid has (at least) one major shortcoming: Considering the whole globe, it is far too detailed near the poles and too coarse at the equator. The reason is simply that the number of grid points per latitude, starting from the north pole going south at constant latitude steps, is constant. So if the region of interest is a small area it might be a considerable improvement to the accuracy if the pole is "virtually" rotated to that area, such that the highest accuracy is obtained for this ROI. Although this rotation does not pose a huge problem computationally, not all libraries can properly deal with data files that contain data in this form (again, "grib" supports this). Also, it is usually a bit more tricky to formulate a simple projection definition for this kind of data.
Gaussian grid:
The Gaussian grid basically uses non-evenly spaced latitude values in combination with evenly spaced longitude values (per one latitude the longitude values are evently spaced, but the number of longitude values may be different at different latitudes). Again, this grid type is not terribly complicated, but it is also not commonly supported (no easy projection definition). Luckily, the errors introduced if treating a Gaussian grid like a regular grid are usually neglectible (and mostly close to the poles anyway).
Triangular/hexagonal grid:
This is probably the most complicated (apart from completely irregular grids) grid. There is no projection definition, no support in most libraries, and dealing with this kind of grid usually boils down to handle it just like a complete irregular grid, meaning the need to extract the coordinates of the cells (usually: the center).

So, in order to use the data, some kind of transformation is necessary. That can be either the knowledge of the coordinates (in lon/lat) for each point, or the warping of the whole data blob into some well-known grid. For simple grids, it is as easy as using the standard gdal tool gdalwarp. For more complicated grids however, this is not enough. One solution that seems to work for transforming many kinds of grids back and forth is "cdo", which is also able to interpolate data from one grid to another. It appears that "cdo" is not very widely known, but if grid transformations is an issue, it is definitely worth a try!

Although cdo is indeed capable of dealing with/regridding triangular or unstructured grids, unfortunately it has some issues when the input is a grib file (issues = segmentation faults). So the "easy" solution is to convert the grib file to a different file format (this, for some reason, is possible with cdo), such as NetCDF. Again, there is a "but": gribs with triangular meshes cannot be translated to NetCDF directly. Instead, the grib data needs to be transferred to a different format, along with a manual definition of the grid. The result can then be converted to NetCDF and finally warped to a different grid.

Update #2:
Luckily, cdo is under active development and the bug using grib files has been fixed (will be in version 1.6.7)!

Donnerstag, 9. Oktober 2014

PostGIS raster and out-db storage

The following is a story about data management, terminology, GIS and frustration.
The story starts with weather data, which is what I deal with most of my working hours. And dealing with weather and climate data is far from trivial. Not only are there large amounts of data (and boy, there are!) which one has to handle, it is usually good to understand how that data was produced and for which location in space and time it applies. The former is not strictly necessary, the latter is essential, which leads me to GIS. GIS means geo information system and is an acronym used for pretty much anything that in one way or the other relates to maps and/or positions on earth (or, for that matter, positions on or maps of any sphere-like shaped body). The fact that the earth is not an exact sphere and the myriads of different ways to express that fact makes dealing with GIS hard. Also, all the different ways of flattening the surface in order to create a 2d representation (i.e. a map) are overwhelming. That is why there is software to make life a little easier, software, that manages to convert from different types of maps back and forth, and to generally do stuff with the spatial data.
Luckily, there is a number of free and open source software packages for GIS, be it desktop applications such as QGIS or GRASS, libraries such as proj.4 or GDAL, or things as complex as an extension to an SQL database to allow spatial data in the context of the relational database (i.e. PostGIS). And, as the title of this post suggests, PostGIS is what I am writing about. The first thing to note is that, generally speaking, PostGIS is a spatial enabled database, i.e. a database that allows to store and manipulate spatial data. Typically, in GIS two main types of spatial data are considered: shapes (or: vector-based, as in computer graphics) data and rasters (or: bitmap-based, as in computer graphics). PostGIS can deal with both types, and it does handle it pretty damn good. For example, you can tell PostGIS to get raster data out of a specific raster at a given geographic position (say, you have a raster with temperature data and you would like to know the temperature at a given longitude/latitude). The good news is that dealing with GIS data is pretty much standardized, there is even a standard for handling GIS data with SQL (and PostGIS follows this standard). And, there are open standards for file formats, which takes away much of the pain that dealing with different competing (or proprietary) standards generates in other areas.
Now let's talk a bit about the actual data. One of the commonly used weather models is called GFS (global forecast system), it is a global model for a wide variety of parameters (temperature, pressure, wind speed, just to name a few), usually more than 300 parameters are computed. The file format in which this data is distributed is called grib, a name that probably should give a hint towards gridded binary data. Despite the fact that this data format is awfully complicated and requires a lot of additional information (so-called tables) about the dataset, which may or may not be available, it is widely used in meteorology. For the GFS, it is indeed not a bad choice, as it supports JPEG2000 compression, which really makes a difference in file size. So, to state a few numbers, one computed time-step of the highest accuracy GFS (i.e. 0.5 degrees horizontal and vertical resolution) with all of its hundreds of parameters needs about 50MB. Note that this is only one time step, i.e. the complete forecast for 8 days in 3-hour steps takes 64 times that amount. Since the GFS is updated four times a day, the complete dataset of one day amounts to about 10GB. This is quite a large amount.
Back to the data handling with PostGIS. PostGIS supports importing raster data from files, so in theory it is indeed possible to simply stick the complete grib file into the database and go on with the really important work. However, in this case, the data compression of the grib files is lost, which increases the required memory by a factor of 5 - 10 (!). Not to mention that the import takes an awkward amount of time. Luckily, there is a simple solution to that problem: PostGIS is able to import raster data without the actual data, by simply referencing the original file instead of importing it. Splendid! However, there is one small drawback in the whole raster-data-into-PostGIS loop: the data format in which the actual data is stored inside PostGIS is not documented, despite the fact that it is called Well-Known-Binary (WKB) format. After some googling the reader may object that there is a well documented standard for that file format. Indeed there is, but it only refers to shape data, not to raster data! I think that the WKB for rasters was created by the PostGIS/raster creators. But since it is pretty much the opposite of well-known, we'll probably never know.
Let's get back to the data for a moment. The grib files that hold the GFS data usually contain the complete parameter set for one time step, meaning it is not really one single raster, but merely a collection of rasters. One raster per parameter, or, to express that in the terminology of GDAL, which is the library that PostGIS uses to handle raster data, one raster band within the raster file per parameter. So, the data storage solution seems obvious: store the raster as references (called out-db raster) in PostGIS! This allows to keep only one copy of the original files, it allows to do all the nice things that PostGIS is designed to do transparently with the data which is actually stored in the grib files, so PostGIS handles the data access and everything is beautiful. Right? Wrong. Taking a very close look at what PostGIS actually stores for out-db rasters (thanks open source!), it appears that the WKB format only allows one byte to store the band number of the data file that is referenced. In other words, PostGIS actually stores some metadata of the raster and the pairs (referenced band number, referenced file name), but the referenced band number is only allowed numbers between 0 and 255 (GDAL starts counting raster bands from one, so that translates to band numbers between 1 and 256). So for band numbers 1-256 that works fine. But larger band numbers cannot be stored in one byte, and even worse, there is an overflow, i.e. it is not only impossible to access band 257 of the original grib file, instead band number 1 is accessed.
Why is that bad? Because there instead of a meaningful error message (or, even better, the correct band), PostGIS happily returns the wrong values when requesting a band that cannot be accessed. If I had not looked into the source code, I would not have spotted this shortcoming, so at some point the data would have been all wrong. And that is bad. To be fair, now that I know what to look for, I found the corresponding issue in the PostGIS raster bug tracker and there is a fix promised for version 2.2, but that is not available yet. And the really annoying thing is: there are only a very inconvenient work-arounds for this bug, namely to split the grib file into smaller grib files containing less than 256 bands or, of course, to import the data itself into the database and live with the 10-fold increase in required disk-memory.

New blog about technical stuff

I am a bit upset about missing/wrong documentation of various software products, so I decided to write a couple of posts about how to handle the difficulties. Or, in my particular case, abandon a simple and elegant approach just because a number can not be larger than 255. Of course it can't. Nobody ever encountered a larger number, d'uh.