Donnerstag, 24. Dezember 2015

NWP tutorial @32c3

I am going to give a Numerical Weather Prediction (NWP) tutorial at the 32nd Chaos Communication Congress in Hamburg. It will be mainly about installing and running WRF, with some excursions to related topics here and there. And I am really excited about this!

Dienstag, 15. September 2015


Is it just me, or is there a surge of "challenges" / "competitions" / "contests" of all kinds and colors lately? It appears to be a very cheap way to get people working for you (or for your objectives). So you need to boost your business and you don't have good ideas yourself and/or the capabilities to realize them? No problem, start a "hey-give-me-your-ideas"-challenge! Hopefully a number of good ideas will emerge and eventually you can take advantage of many of those ideas while "paying" only the winner, who probably spent a lot more time on it than you have to pay him for (if there is any price money involved at all).
Don't get me wrong, I do like the general idea of these challenges, and in particular I like participating - for example in the yearly Copernicus Masters Challenge. But the balance between how much effort the participants put into it and the actual rewards needs to be fair. Anyway, just my two cents on that topic (and maybe this year I'll make it to the finals :-) ).

Sonntag, 13. September 2015

EMS/ECAM meeting

Last week I attended the EMS (European Meteorological Society) / ECAM (European Conference on Applications of Meteorology) meeting in Sofia, Bulgaria. In summary, it was a very nice meeting, I learned a lot and now I am even more convinced that I am on the right track developing the web-based model verification system.
The first part of the conference was a bit off target for me, but still very interesting - there were sessions about observations, climate & a lot about severe weather warnings (including communication and impact research - very, very interesting stuff). The second part of the conference was more about verification and model evaluation, which is also what my poster was about. The verification sessions were all very interesting and sparked a lot of new ideas... moreover, there is a "find new verifiation measures"-competition by the WMO working group - I'll probably give that a shot! Finally, there was the MesoVICT meeting, which is more of a "meta"-comparison of verification schemes; maybe I'll also take part in this to use it as a testbed for the verification system and the new verification measures that I've come up with.
Besides the lack of a continuous coffee supply chain, the conference was well organized, and I really enjoyed it! There is just one more thing that stroke me rather odd: some of the sessions (at least 2) turned out to be more-or-less one-man-shows, and in my opinion that just is not what a conference like this should be about: in one of the sessions the session chair already had 2 scheduled talks (which is already questionable), but finally turned out to give 5 (out of 6) talks in that very session.

Mittwoch, 1. April 2015

Custom SRIDs in PostGIS

One small, but occasionally very important feature of PostGIS is the ability to deal with custom spatial reference systems (SRS, in PostGIS SRS are defined using an SRID). Given data in some obscure projection, it is not required to transform the data to some well-known SRS, but it is possible to store the data directly, provided that a custom SRID entry is stored into the spatial_ref_sys relation in PostGIS. The columns in that table are basically the SRID (anything > 100000 should be safe to use for custom projections) and the projection definition in two formats, namely well-known text and as a proj4 string. Note that certain proj4 strings cannot be properly expressed as well-known text (f.e. rotated poles), in that case the well-known text definition can be extended by EXTENSION["PROJ4","<some proj4 string> +wktext"]. Also note that it seems that the "-m" option in the proj4 string is ignored, so it is better to use an equivalent "+to_meter" option (but I haven't tested this in every detail).

Dienstag, 24. Februar 2015

KNN with PostGIS

This article is basically a note to my future self, in case I need to look up how to do KNN searches with PostGIS again. The technique is taken from the talk "Writing better PostGIS queries" by Regina Obe (given at FOSS4G 2014 in Portland).

The problem statement

Given a set of geometries G1, find the k nearest neighbours for each of these geometries from a second set of geometries G2. For simplicity, I assume G1 and G2 to be sets of points which are stored as simple POINT geometries in PostGIS. Note that G1 and G2 could be the same set, I will use this in the example code.

  gid serial NOT NULL,
  geom geometry,
  CONSTRAINT knntst_pk PRIMARY KEY (gid)
CREATE INDEX knntst_geom_gist
  ON knntst
  USING gist
INSERT INTO knntst (geom) VALUES ('POINT (0 0)'::geometry), ('POINT (1 0)'::geometry), ('POINT (0 1)'::geometry), ('POINT (1 1)'::geometry), ('POINT (3 0)'::geometry), ('POINT (4 0)'::geometry), ('POINT (3 3)'::geometry), ('POINT (4 5)'::geometry);

Using the KNN operator

There are many ways for solving this problem in PostGIS, but the trouble is to find a way that actually uses the spatial index and that does not require to set a maximum distance beforehand (and that provides the complete output, i.e. all k nearest neighbours for each of the points). The solution is to use the KNN operator "<->", which may be included in the ORDER BY clause. However, one argument to that operator needs to be constant in order to actually take advantage of the spatial index. So, for the given problem, a LATERAL join is required in the following way:
SELECT g1.gid, g2.gid, ST_AsText(g1.geom), ST_AsText(g2.geom), ST_Distance(g1.geom, g2.geom) FROM knntst AS g1 CROSS JOIN LATERAL (SELECT g3.gid, g3.geom FROM knntst AS g3 WHERE g1.gid!=g3.gid ORDER BY g3.geom <-> g1.geom LIMIT 3) AS g2
The output table contains the 3 nearest neighbours to each of the points in the source table, excluding the point itself.

Donnerstag, 19. Februar 2015

WRF+MET tutorial review

So, I've spent the past 10 days in Boulder, CO, attending the WRF tutorial and the MET tutorial. WRF is the Weather Research and Forecasting model and MET is the Model Evaluation Tool, both (to some degree) developed and maintained at the National Center for Atmospheric Research (NCAR) @ Boulder, where the tutorials were held (it is somehow part of another acronym, UCAR, the University Cooperation for Atmospheric Research. I don't get the overall connection of all those acronyms - NOAA would be another one, or MMM, or NESL, or NCEP, and there are certainly more). First a general remark about the whole trip: it was certainly worth it!

Important notice: This report describes my very personal and thus totally subjective impression of the tutorials. It is by no means intended to offend anybody or anything. Just my personal view. Feel free to agree or disagree in the comments.

Part 1: the WRF (pronounce: worf, as in the Klingon original) tutorial

The instructors

There were a number of instructors for all the different aspects of running the WRF system, basically the WRF core developers and some of the support staff. In short, have a look at the authors' list of the WRF manual, these are the people who taught us.
In the following, I will try to explain both, the content of the tutorial (like the basic steps for running WRF) as well as my impression of the tutorial itself.

The build process

Ok, so WRF is an old piece of software. But why on earth hang on to some custom, home-made, weird build system including a very counter-intuitive directory structure? Why not use available build tools, like, for the more traditional software developers, autotools (automake/autoconf) or, the more modern way, cmake? There are certainly reasons for using the custom system, but they are beyond me. That being said, the build usually works more or less out-of-the-box, provided there is a reasonable number of dependencies installed (basically make, C & Fortran compilers, m4, NetCDF, mpich). After successful compilation of WRF and then WPS, there are a number of executables (easily detected by the very disturbing .exe suffix) in the different directories. Most notably, there are ungrib.exe, geogrid.exe and metgrid.exe as part of WPS, and real.exe (or ideal.exe) and wrf.exe as part of WRF itself. The explanations of the actual WRF build within the tutorial were short, but contained what was needed to build WRF (which was part of the practical sessions).

Data preprocessing

In order to use WRF for a real climate/weather simulation, it is necessary to feed in some data. That means in particular that an initial state of the whole system is needed, and on top of that some sort of boundary condition that applies to the spatial borders at as many time steps as possible. The only exception to the need of boundary conditions would be a WRF run using the whole earth as a spatial domain, in this case periodic boundary conditions kick in. However, running WRF as a global model is generally discouraged and using a completely different system is advised (MPAS). So from now on I assume that boundary conditions are given. Moreover, there is (more or less) static data required, such as topography, land use, land/sea mask, and so forth. The WRF preprocessing system WPS basically does the following:
  1. Prepare (pronounce: unpack) the data for initial and boundary conditions that come from an external source (such as GFS) and that are given as GRIB (or GRIB2) files. This is what the executable ungrib.exe does. More precisely, it converts the data from GRIB files into an intermediate file format, which is again something completely custom-made for WRF.
  2. Prepare the static data such as land-sea mask. This is what the executable geogrid.exe does.
  3. Combine the two and reproject the data blob to the projection that WRF is going to use (either Lambert conformal, Polar stereographic, Mercator or regular Lat-Lon). This is what metgrid.exe does. Note that only horizontal interpolation is performed at this stage, the vertical interpolation is part of the next step. The resulting files consist of the complete initial conditions and the boundary conditions for successive timesteps.
The output of WPS are files in NetCDF format that contain the initial and boundary conditions for the spatial and temporal domain and that can then be processed by WRF. Important configuration options for WPS include the valid times of the data source, the projection and the extent of the spatial domain. To actually run ungrib.exe, the given GRIB files need to follow a very specific naming convention, which is another weird inconvenience, but not a show-stopper (just like most of the system's little and large weirdnesses). Also, it is not entirely clear which GRIB files, and in particular which projections, are supported (there are GRIB files which contain Gaussian grids, or triangular grids). Moreover, different GRIB files require different tables that define which data is to be extracted from the files are required. Those tables are again in a very specific format, but luckily there are pre-defined tables for the most common GRIB files. The explanations in the tutorial may have been confusing for people that have never dealt with GRIB files, but for me they were fine.
The static data ships as a separate tarball with WPS and it is actually possible to enhance it by custom data (for example higher resolution data in specific areas). However, and again, the file format for this static data is something totally non-standard and GIS people will probably throw their hands up in horror. There is a number of standard formats for raster data out there and again, it is beyond me why none of them is used. But anyway, running the three preprocessing steps usually poses no big trouble and is also quite fast, as it basically consists of data conversion and (horizontal) interpolation.
A short note on the configuration options: All the options for WPS (the same is true for WRF also) are given in a file called namelist.wps, which resembles a Fortran-style configuration consisting of named options in named dictionaries. That is per se not a bad choice, however, there is one thing that strikes me as rather odd: within the actual WRF source code, all options are accessed from a central dictionary, no matter in which dictionary they were in within the namelist file. That means that despite the different dictionaries, no option name may occur twice, even if given in different dictionaries. But I may be wrong with that conclusion (and that would be a problem for people actually extending WRF by adding new options, not the standard user).
The tutorial gave a very good introduction to the WPS and how to prepare the data such that WRF can be run successfully. The exercises in the practise sessions were prepared thoroughly, with input data ready to be fed into WPS. Additionally, there was one exercise without given data or goal, but rather a do-it-your-own-way, which I really enjoyed. Feeding in the standard data (that would be GFS) is really not that hard and already gets you started.

Running WRF

Running the main WRF system is a two-step process. First, the input data as given by WPS needs to be interpolated vertically (remember, WPS only interpolates horizontally). This is a strict requirement since the WRF model uses its own notion of how vertical layers are expressed, which is quite different from standard pressure levels or height-above-ground/mean-sea-level. This is done by the program real.exe - as the name hints this is for "real" data, in contrast to idealized cases, where WPS is not required and all initial and boundary conditions are created programmatically, but this is a different story I will not go into here. So, the real.exe program creates NetCDF files called wrfinput and wrfbdy (short for boundary) containing all the necessary input data to do the real work - run wrf.exe. All the steps taken so far are not very computationally intensive, as they mainly consisted of data unpacking and interpolation. But running WRF itself tends to be more demanding, hence it should be run on more than one core. Luckily, WRF supports MPI out of the box, so no additional obstacle here, simply run wrf.exe on as many cores/machines as are available. As it turns out, MPI does not play well with NAT, so I could not find a way to run WRF in VMs/containers distributed over different physical machines (I tried this several months ago). Unless I am missing some non-obvious network configuration, it seems highly complicated or plainly impossible to do so. So the advice is: run WRF on dedicated machines, and all is well. The tutorial covered a great deal of information on what wrf.exe actually does and what the numerous configuration options do. In short, (partial) differential equations are solved by a Runge-Kutta-method, i.e. solving the time derivative by RK and resolving the spatial derivatives by finite differences. There is one noteworthy detail, which is that the RK-step is separated to account for two different time-scales (i.e. the system that has to be solved is something like a slow-fast system, although the time scales are not that far apart, maybe by a factor of 100-1000). I might have misunderstood some of the details, but the basic mechanism should be correct. The actual atmospheric model is described by a physics module which is highly customizable and takes various different options. In particular, the model for the planetary boundary layer and the cloud model are crucial, and a lot more physical models for the atmosphere that I already forgot. The tutorial covered most of the options and the physics schemes behind them, but of course there was not enough time to go into all the details. I really missed some sort of "cheat-sheet" for WRF, a short summary of all the options and a one-liner what is important at which grid sizes. Another aspect that was missing from the tutorial (although it is not strictly the scope of a general WRF tutorial) was tips and tricks for actually running WRF in a (semi-)productive way, i.e. hardware recommendations or what tools to use for the job management task.


In short, postprocessing WRF data is a desaster. The WRF output consists of NetCDF files, but there are 2 problems: The data is on a "staggered" grid, meaning that most data is at located at the center of the grid cells, but winds are located at the edges (also, there is specific data located at the vertical edge instead of the center, but this is usually not really important for further processing). And the data is on WRF specific vertical levels. So there are two things that need to be done: unstagger the grid and interpolate to pressure/height levels. Unfortunately, there is not one usable way to do this in a clean and sane manner. Instead, there are a number of postprocessing tools, none of which works straight out-of-the-box. The tool that I've finally managed to run in the practice session is called UPP (universal post processor), it generates an obscene amount of support files and requires an insane configuration, but eventually it can generate standard GRIB(1) files. Which is all I need. Somehow also the NCAR script language NCL (which is also capable of doing graphics) shoud be able to do this, but I couldn't get that far. At least I did manage to use cdo (yes, I am very much in favour for using standard and specialized software for each job) for unstaggering the wind data, but the vertical interpolation is still an open problem, but it might be possible with cdo, which would take away much of the pain of WRF postprocessing. Again, the reasons behind the decision not to support one single (and simple) post-processing tool to convert the WRF output to some usable data format (or even better, provide a working cdo configuration) and let the user decide what software to use from that point onwards, but to pseudo-support a number of more or less unusable and cryptic software is beyond me.

Part 2: The MET tutorial

Well, most participants actually attended because they were in Boulder for the WRF tutorial and decided to stay for two additional days. For me it was quite the opposite. Currently I am in the process of building a large model evaluation system and I was curious what exactly the NCAR Model Evaluation Tool does (and what model evaluation is all about). The reason is simple: I don't have much (any) experience in model evaluation and maybe I will include MET as an option in my own system. But after attending the tutorial I have to admit that this is not very likely, it simply does not meet my demands for performance, features and extendability.

Preprocessing the data

Depending on the source data there needs to be some preprocessing. For point observational data, the source may be one of the more or less obscure PrepBUFR, MADIS or a simple ASCII format and is transformed into a NetCDF file. Gridded data does not need special pre-processing, it can be processed directly as GRIB1 or GRIB2 or CF-compliant NetCDF. Of course, if verifying gridded data versus gridded data, both data sets need to be in the same projection (the developers are thinking about including reprojection into MET in the next version, which is, in my opinion, a big mistake and completely unnecessary, since there are standard tools that can handle the reprojection beforehand). Moreover, there is some custom tool to define a "polygon" (list of lon/lat pairs, not a polygon in the GIS sense) that can be used as a mask to the gridded data. Again, why there is a custom tool for defining a custom "polygon" definition instead of using, say, a well-known text (WKT) definition of standard GIS objects is beyond me.


The main output of the MET system is statistics of the data, i.e. first and second order statistics, skill scores and confidence intervals. The tutorial covered a good portion of the background for this, but of course it was no replacement for a statistics 101 class. MET also provides tools for combining the statistics of multiple MET runs, which allows for processing of large data sets in chunks.

Advanced grid evaluation

The most interesting topic for me was the grid-versus-grid evaluation, which can be used to compare two different model runs or to verify model data using gridded observational data (e.g. radar or satellite). It appears that this type of comparison is a rather new topic, and unfortunately the most interesting methods present in MET where hardly explained (wavelet-based methods, optical flow-based methods). Another approach was explained in a bit more detail, the object based evaluation (called MODE in MET-speak). But also this explanation was not really satisfying, as it really is only well-known methods from image processing, which are inconsistently renamed. Actually, this was the one occasion where I lost my good mood during the tutorial, because re-implementing stuff and inventing a whole new nomenclature for well-known methods (segmentation, classification, tracking) is simply evil, scientifically speaking. At least I had the opportunity to speak to one of the experts for spatial verification, and he could point me to a list of resources to read through, which already helps a lot.


As I've already mentioned, the trip was certainly worth it. The WRF tutorial was really good, most lectures were clear and actually entertaining, the practise sessions were helpful and in the end I felt like I was able to understand and run a WRF system. Moreover, now I really want to run WRF myself, which is probably the biggest compliment for the instructors.
On the other hand, there was the MET tutorial, which was interesting in its own respect. But I have to admit that I was expecting a bit more, but at least the tutorial gave me some insights on model evaluation basics, and now I know that I am following the right track with my own evaluation system.
Probably I forgot to write about a lot of things, but I would like to mention one more aspect of the tutorial: the participants. For the WRF tutorial there were roughly 50 people, maybe half (or two-thirds) from academia and the rest from industry, and from all over the world. The people I met there were all very nice, and I think we all had a really good time in Boulder!

Montag, 24. November 2014

Regridding spatial data

This post is about one of the most difficult tasks in processing spatial data, in particular gridded data: regridding. Regridding is all about "warping" a given data blob into a different grid. First of all, it is tremendously helpful to really know what grid the original data is in. Here are a few examples:
Regular lon/lat grid:
Given the corner coordinate as a longitude/latitude pair and the per-data-pixel offset in degrees, this is one of the easiest format for gridded spatial data. In terms of projections, this can be expressed as a standard epsg:4326 projection (disregarding subtile differences in the meaning of longitude and latitude depending on the used "datum", i.e. the used model of the earth), meaning that all longitude/latitude coordinate pairs are simply "flattened" to the plane. Dealing with this kind of data is usually pleasant, gdal and other spatial libraries fully support this, no big deal. Except, the notion of longitude/latitude in commonly used data files in the data format "grib" (which is often used in numerical weather models) is slightly (usually neglectible) different from the standard WGS84 datum.
Rotated lon/lat grid:
The regular lon/lat grid has (at least) one major shortcoming: Considering the whole globe, it is far too detailed near the poles and too coarse at the equator. The reason is simply that the number of grid points per latitude, starting from the north pole going south at constant latitude steps, is constant. So if the region of interest is a small area it might be a considerable improvement to the accuracy if the pole is "virtually" rotated to that area, such that the highest accuracy is obtained for this ROI. Although this rotation does not pose a huge problem computationally, not all libraries can properly deal with data files that contain data in this form (again, "grib" supports this). Also, it is usually a bit more tricky to formulate a simple projection definition for this kind of data.
Gaussian grid:
The Gaussian grid basically uses non-evenly spaced latitude values in combination with evenly spaced longitude values (per one latitude the longitude values are evently spaced, but the number of longitude values may be different at different latitudes). Again, this grid type is not terribly complicated, but it is also not commonly supported (no easy projection definition). Luckily, the errors introduced if treating a Gaussian grid like a regular grid are usually neglectible (and mostly close to the poles anyway).
Triangular/hexagonal grid:
This is probably the most complicated (apart from completely irregular grids) grid. There is no projection definition, no support in most libraries, and dealing with this kind of grid usually boils down to handle it just like a complete irregular grid, meaning the need to extract the coordinates of the cells (usually: the center).

So, in order to use the data, some kind of transformation is necessary. That can be either the knowledge of the coordinates (in lon/lat) for each point, or the warping of the whole data blob into some well-known grid. For simple grids, it is as easy as using the standard gdal tool gdalwarp. For more complicated grids however, this is not enough. One solution that seems to work for transforming many kinds of grids back and forth is "cdo", which is also able to interpolate data from one grid to another. It appears that "cdo" is not very widely known, but if grid transformations is an issue, it is definitely worth a try!

Although cdo is indeed capable of dealing with/regridding triangular or unstructured grids, unfortunately it has some issues when the input is a grib file (issues = segmentation faults). So the "easy" solution is to convert the grib file to a different file format (this, for some reason, is possible with cdo), such as NetCDF. Again, there is a "but": gribs with triangular meshes cannot be translated to NetCDF directly. Instead, the grib data needs to be transferred to a different format, along with a manual definition of the grid. The result can then be converted to NetCDF and finally warped to a different grid.

Update #2:
Luckily, cdo is under active development and the bug using grib files has been fixed (will be in version 1.6.7)!