Data Structures

  • xarray has 2 fundamental data structures:
    • DataArray, which holds single multi-dimensional variables and its coordinates
    • Dataset, which holds multiple variables that potentially share the same coordinates

DataArray

The DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:

Attribute Description
data numpy.ndarray or dask.array holding the array’s values.
dims dimension names for each axis. For example:(x, y, z) (lat, lon, time).
coords a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
attrs an OrderedDict to hold arbitrary attributes/metadata (such as units)
name an arbitrary name of the array
# Import packages
import numpy as np
import xarray as xr
# Create some sample data
data = 2 + 6 * np.random.exponential(size=(5, 3, 4))
data
array([[[ 4.43508091,  2.15124776,  2.10558373, 19.94490096],
        [ 2.60872261,  3.52042546, 18.63140782, 13.72228345],
        [27.30081072,  5.96095059,  2.91805875, 17.6037812 ]],

       [[ 8.74433669,  8.27582507,  7.03698317,  3.31496248],
        [ 4.87882522,  4.70457746,  3.64326797,  2.7801325 ],
        [ 8.78350864,  2.13838567,  2.85726807,  3.9849055 ]],

       [[ 8.6100285 ,  3.51406472, 18.56975917,  3.39210852],
        [ 3.36315056,  5.33039337,  6.57733997,  5.73891663],
        [14.24651928,  2.14877667, 16.73808201, 15.76765027]],

       [[ 5.64740648, 17.38358732,  6.21165503, 17.54573517],
        [12.23201627,  2.88654703,  2.78561898,  2.13348846],
        [ 2.38664008,  5.1335385 , 24.82225676,  7.95046948]],

       [[ 8.41788254, 16.13446971,  4.76249737,  3.26780217],
        [16.20042451,  6.25651099, 11.77345362,  2.22748438],
        [11.94589889,  3.79870376, 10.81441387,  8.1399644 ]]])

To create a basic DataArray, you can pass this numpy array of random data to xr.DataArray

prec = xr.DataArray(data)
prec
<xarray.DataArray (dim_0: 5, dim_1: 3, dim_2: 4)>
array([[[ 4.435081,  2.151248,  2.105584, 19.944901],
        [ 2.608723,  3.520425, 18.631408, 13.722283],
        [27.300811,  5.960951,  2.918059, 17.603781]],

       [[ 8.744337,  8.275825,  7.036983,  3.314962],
        [ 4.878825,  4.704577,  3.643268,  2.780133],
        [ 8.783509,  2.138386,  2.857268,  3.984906]],

       [[ 8.610028,  3.514065, 18.569759,  3.392109],
        [ 3.363151,  5.330393,  6.57734 ,  5.738917],
        [14.246519,  2.148777, 16.738082, 15.76765 ]],

       [[ 5.647406, 17.383587,  6.211655, 17.545735],
        [12.232016,  2.886547,  2.785619,  2.133488],
        [ 2.38664 ,  5.133539, 24.822257,  7.950469]],

       [[ 8.417883, 16.13447 ,  4.762497,  3.267802],
        [16.200425,  6.256511, 11.773454,  2.227484],
        [11.945899,  3.798704, 10.814414,  8.139964]]])
Dimensions without coordinates: dim_0, dim_1, dim_2

NOTE:

Xarray automatically generates some basic dimension names for us.


You can also pass in your own dimension names and coordinate values:

# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2019-04-01', periods=5)
times
DatetimeIndex(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04',
               '2019-04-05'],
              dtype='datetime64[ns]', freq='D')
# Use numpy to create array of longitude and latitude values
lons = np.linspace(-150, -60, 4)
lats = np.linspace(10, 80, 3)
lons, lats
(array([-150., -120.,  -90.,  -60.]), array([10., 45., 80.]))
coords = {'time': times, 'lat': lats, 'lon': lons}
dims = ['time', 'lat', 'lon']
# Add name, coords, dims to our data
prec = xr.DataArray(data, dims=dims, coords=coords, name='prec')
prec
<xarray.DataArray 'prec' (time: 5, lat: 3, lon: 4)>
array([[[ 4.435081,  2.151248,  2.105584, 19.944901],
        [ 2.608723,  3.520425, 18.631408, 13.722283],
        [27.300811,  5.960951,  2.918059, 17.603781]],

       [[ 8.744337,  8.275825,  7.036983,  3.314962],
        [ 4.878825,  4.704577,  3.643268,  2.780133],
        [ 8.783509,  2.138386,  2.857268,  3.984906]],

       [[ 8.610028,  3.514065, 18.569759,  3.392109],
        [ 3.363151,  5.330393,  6.57734 ,  5.738917],
        [14.246519,  2.148777, 16.738082, 15.76765 ]],

       [[ 5.647406, 17.383587,  6.211655, 17.545735],
        [12.232016,  2.886547,  2.785619,  2.133488],
        [ 2.38664 ,  5.133539, 24.822257,  7.950469]],

       [[ 8.417883, 16.13447 ,  4.762497,  3.267802],
        [16.200425,  6.256511, 11.773454,  2.227484],
        [11.945899,  3.798704, 10.814414,  8.139964]]])
Coordinates:
  * time     (time) datetime64[ns] 2019-04-01 2019-04-02 ... 2019-04-05
  * lat      (lat) float64 10.0 45.0 80.0
  * lon      (lon) float64 -150.0 -120.0 -90.0 -60.0

This is already improved upon from the original numpy array, because we have names for each of the dimensions (or axis in NumPy parlance).

We can also add attributes to an existing DataArray:

prec.attrs['units'] = 'mm'
prec.attrs['standard_name'] = 'precipitation'
prec
<xarray.DataArray 'prec' (time: 5, lat: 3, lon: 4)>
array([[[ 4.435081,  2.151248,  2.105584, 19.944901],
        [ 2.608723,  3.520425, 18.631408, 13.722283],
        [27.300811,  5.960951,  2.918059, 17.603781]],

       [[ 8.744337,  8.275825,  7.036983,  3.314962],
        [ 4.878825,  4.704577,  3.643268,  2.780133],
        [ 8.783509,  2.138386,  2.857268,  3.984906]],

       [[ 8.610028,  3.514065, 18.569759,  3.392109],
        [ 3.363151,  5.330393,  6.57734 ,  5.738917],
        [14.246519,  2.148777, 16.738082, 15.76765 ]],

       [[ 5.647406, 17.383587,  6.211655, 17.545735],
        [12.232016,  2.886547,  2.785619,  2.133488],
        [ 2.38664 ,  5.133539, 24.822257,  7.950469]],

       [[ 8.417883, 16.13447 ,  4.762497,  3.267802],
        [16.200425,  6.256511, 11.773454,  2.227484],
        [11.945899,  3.798704, 10.814414,  8.139964]]])
Coordinates:
  * time     (time) datetime64[ns] 2019-04-01 2019-04-02 ... 2019-04-05
  * lat      (lat) float64 10.0 45.0 80.0
  * lon      (lon) float64 -150.0 -120.0 -90.0 -60.0
Attributes:
    units:          mm
    standard_name:  precipitation

Dataset

  • Xarray’s Dataset is a dict-like container of labeled arrays (DataArrays) with aligned dimensions. - It is designed as an in-memory representation of a netCDF dataset.
  • In addition to the dict-like interface of the dataset itself, which can be used to access any DataArray in a Dataset. Datasets have the following key properties:
Attribute Description
data_vars OrderedDict of DataArray objects corresponding to data variables.
dims dictionary mapping from dimension names to the fixed length of each dimension (e.g., {lat: 6, lon: 6, time: 8}).
coords a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
attrs OrderedDict to hold arbitrary metadata pertaining to the dataset.
name an arbitrary name of the dataset
  • DataArray objects inside a Dataset may have any number of dimensions but are presumed to share a common coordinate system.
  • Coordinates can also have any number of dimensions but denote constant/independent quantities, unlike the varying/dependent quantities that belong in data.

To create a Dataset from scratch, we need to supply dictionaries for any variables (data_vars), coordinates (coords) and attributes (attrs):

dset = xr.Dataset({'precipitation' : prec})
dset
<xarray.Dataset>
Dimensions:        (lat: 3, lon: 4, time: 5)
Coordinates:
  * time           (time) datetime64[ns] 2019-04-01 2019-04-02 ... 2019-04-05
  * lat            (lat) float64 10.0 45.0 80.0
  * lon            (lon) float64 -150.0 -120.0 -90.0 -60.0
Data variables:
    precipitation  (time, lat, lon) float64 4.435 2.151 2.106 ... 10.81 8.14

Let’s add some toy temperature data array to this existing dataset:

temp_data = 283 + 5 * np.random.randn(5, 3, 4)
temp = xr.DataArray(data=temp_data, dims=['time', 'lat', 'lon'], 
                    coords={'time': times, 'lat': lats, 'lon': lons},
                    name='temp',
                    attrs={'standard_name': 'air_temperature', 'units': 'kelvin'})
temp
<xarray.DataArray 'temp' (time: 5, lat: 3, lon: 4)>
array([[[291.067308, 282.59583 , 286.514387, 279.784996],
        [280.69826 , 277.181048, 276.998507, 283.234755],
        [288.075202, 288.162201, 282.123016, 279.179585]],

       [[284.365503, 285.320669, 277.804468, 292.489392],
        [281.912137, 271.326958, 275.402304, 283.102304],
        [283.385009, 285.068898, 282.416826, 286.847279]],

       [[284.225161, 286.364076, 279.136546, 283.171794],
        [288.33622 , 286.568405, 283.427888, 287.146929],
        [283.232354, 284.007581, 282.913762, 284.85474 ]],

       [[282.210031, 278.921674, 295.395575, 282.505313],
        [280.417009, 279.191678, 278.279562, 281.70787 ],
        [281.791588, 283.762458, 287.691885, 282.041789]],

       [[280.795983, 282.317272, 289.734355, 286.826475],
        [284.127492, 294.012677, 286.166998, 281.696582],
        [288.024786, 280.485435, 275.519706, 282.380892]]])
Coordinates:
  * time     (time) datetime64[ns] 2019-04-01 2019-04-02 ... 2019-04-05
  * lat      (lat) float64 10.0 45.0 80.0
  * lon      (lon) float64 -150.0 -120.0 -90.0 -60.0
Attributes:
    standard_name:  air_temperature
    units:          kelvin
# Now add this data array to our existing dataset
dset['temperature'] = temp
dset.attrs['history'] = 'Created for the xarray tutorial'
dset.attrs['author'] = 'foo and bar'
dset
<xarray.Dataset>
Dimensions:        (lat: 3, lon: 4, time: 5)
Coordinates:
  * time           (time) datetime64[ns] 2019-04-01 2019-04-02 ... 2019-04-05
  * lat            (lat) float64 10.0 45.0 80.0
  * lon            (lon) float64 -150.0 -120.0 -90.0 -60.0
Data variables:
    precipitation  (time, lat, lon) float64 4.435 2.151 2.106 ... 10.81 8.14
    temperature    (time, lat, lon) float64 291.1 282.6 286.5 ... 275.5 282.4
Attributes:
    history:  Created for the xarray tutorial
    author:   foo and bar

Going Further:

Xarray Documentation on Data Structures: http://xarray.pydata.org/en/latest/data-structures.html


%load_ext watermark
%watermark --iversion -g -m -v -u -d
pandas 0.24.2
numpy  1.16.3
xarray 0.12.1
last updated: 2019-05-17 

CPython 3.6.7
IPython 7.5.0

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 18.2.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : d599de20052dd42e0a80c2b9b98b1b80fec5a0a5