Out of core analysis of general circulation models

Transforming large gridded datasets into scientific results requires innovative descriptive approaches that merge statistical descriptions and physically-motivated analyses. This usually involves performing rather “complex” analysis tasks on gridded datasets.

Research software used for performing these analyses are facing a challenge with the ongoing evolution of geo-scientific models and earth observing networks. Indeed, with the most high-end models being runs on several tens of thousand cores, even a two-dimensional slice of model output cannot be loaded in memory at one time. Model diagnostic tools and gridded data analysis tools should therefore be parallelized and run out-of-core.

I believe that this question is not a “technical” problem but a real challenge for our field of research. We are here facing a “technical” translation of one of the big data challenges in earth system science. We need to embrace this question at a community level.

Great tools have emerged in the python ecosystem for tackling this question. I am in particular thinking about xarray and dask python packages. In a recent project, I am involved in trying to leverage the potential of xarray and dask for the analysis of gridded dataset in earth system science.

More about this project can be found on readthedocs here and here. Contributions/comments/suggestions are welcome on the project page on github.