GroupBy Operations

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable is also supported:

  • Using groupby to calculate a monthly climatology:
import xarray as xr
da = xr.open_dataarray("../data/air_temperature.nc")
da_climatology = da.groupby('time.month').mean('time')

da_climatology
<xarray.DataArray 'air' (month: 12, lat: 25, lon: 53)>
array([[[246.34987, 246.38608, ..., 244.08795, 245.6467 ],
        [248.8576 , 248.90733, ..., 243.50865, 246.75471],
        ...,
        [296.5446 , 296.46982, ..., 295.0812 , 294.53006],
        [297.15417, 297.2383 , ..., 295.77554, 295.63647]],

       [[246.67715, 246.40576, ..., 243.0021 , 244.44383],
        [247.8001 , 247.75992, ..., 242.26633, 245.06662],
        ...,
        [296.78754, 296.63443, ..., 294.2178 , 293.70258],
        [297.2889 , 297.2165 , ..., 294.9558 , 294.87967]],

       ...,

       [[253.74484, 253.64487, ..., 243.9345 , 245.14209],
        [259.12967, 258.62927, ..., 243.07965, 245.46625],
        ...,
        [298.58783, 298.42026, ..., 298.19397, 297.9083 ],
        [298.81143, 298.8566 , ..., 298.7519 , 298.8189 ]],

       [[247.971  , 248.02118, ..., 241.02383, 242.62823],
        [249.73361, 250.16037, ..., 240.96469, 244.11626],
        ...,
        [297.46814, 297.38025, ..., 296.84668, 296.52133],
        [297.8809 , 297.9868 , ..., 297.5655 , 297.53763]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

In this case, we provide what we refer to as a virtual variable (time.month). Other virtual variables include: year, month, day, hour, minute, second, dayofyear, week, dayofweek, weekday and quarter. It is also possible to use another DataArray or pandas object as the grouper.

da.groupby('time.season').median('time')
<xarray.DataArray 'air' (season: 4, lat: 25, lon: 53)>
array([[[246.2    , 246.54999, ..., 243.19499, 244.7    ],
        [247.7    , 248.     , ..., 242.5    , 245.44   ],
        ...,
        [297.     , 296.9    , ..., 295.345  , 294.845  ],
        [297.5    , 297.6    , ..., 296.     , 296.     ]],

       [[273.19998, 273.1    , ..., 266.79   , 268.55   ],
        [273.79   , 273.9    , ..., 266.5    , 269.     ],
        ...,
        [298.9    , 298.6    , ..., 297.29   , 297.1    ],
        [298.9    , 298.9    , ..., 297.745  , 297.9    ]],

       [[258.6    , 258.29   , ..., 249.945  , 250.89   ],
        [259.345  , 259.4    , ..., 250.04999, 252.09999],
        ...,
        [297.5    , 297.29   , ..., 295.4    , 295.     ],
        [298.     , 298.     , ..., 295.9    , 295.79   ]],

       [[264.245  , 263.6    , ..., 249.19998, 250.89   ],
        [270.5    , 270.55   , ..., 248.945  , 252.     ],
        ...,
        [299.19998, 299.     , ..., 299.     , 298.9    ],
        [299.19998, 299.29   , ..., 299.5    , 299.6    ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * season   (season) object 'DJF' 'JJA' 'MAM' 'SON'

Resampling Operations

In order to resample time-series data, xarray provides a resample convenience method for frequency conversion and resampling of time series.

da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
  • Downsample our 6 hourly time-series data to quaterly data:
da1 = da.resample(time='QS').mean(dim='time')
da1
<xarray.DataArray 'air' (time: 8, lat: 25, lon: 53)>
array([[[244.61775, 244.4874 , ..., 243.6617 , 244.84286],
        [246.70831, 246.60774, ..., 243.09488, 245.42445],
        ...,
        [296.29684, 296.1032 , ..., 295.27814, 294.85345],
        [296.90457, 296.85693, ..., 295.94586, 295.85483]],

       [[266.05133, 265.95355, ..., 256.4855 , 258.0242 ],
        [266.68463, 266.89017, ..., 256.30783, 258.96777],
        ...,
        [297.93405, 297.69324, ..., 296.312  , 295.90536],
        [298.18423, 298.11508, ..., 296.65125, 296.639  ]],

       ...,

       [[272.5132 , 272.365  , ..., 262.73245, 264.31403],
        [273.85675, 274.0403 , ..., 262.66376, 265.30276],
        ...,
        [299.58566, 299.3528 , ..., 298.18146, 297.92966],
        [299.52676, 299.55106, ..., 298.55417, 298.6812 ]],

       [[254.5719 , 254.2065 , ..., 245.83794, 247.16304],
        [258.69034, 258.4616 , ..., 245.36269, 248.19038],
        ...,
        [298.82498, 298.7005 , ..., 298.20093, 297.89267],
        [299.03397, 299.12115, ..., 298.70218, 298.74704]]], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 2013-04-01 ... 2014-10-01
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  • Upsample our quarterly time-series data to daily data:
da.resample(time='1D').interpolate('linear')
<xarray.DataArray 'air' (time: 730, lat: 25, lon: 53)>
array([[[241.199997, 242.5     , ..., 235.5     , 238.599991],
        [243.799988, 244.5     , ..., 235.299988, 239.299988],
        ...,
        [295.899994, 296.199982, ..., 295.899994, 295.199982],
        [296.290009, 296.790009, ..., 296.790009, 296.600006]],

       [[243.199997, 243.099991, ..., 238.799988, 240.889999],
        [246.389999, 245.299988, ..., 234.889999, 237.199997],
        ...,
        [297.290009, 297.399994, ..., 296.5     , 296.290009],
        [297.790009, 298.100006, ..., 297.399994, 297.399994]],

       ...,

       [[253.299988, 254.299988, ..., 250.389999, 249.189987],
        [256.5     , 258.      , ..., 252.189987, 252.889999],
        ...,
        [297.889984, 297.889984, ..., 296.199982, 295.5     ],
        [298.889984, 298.790009, ..., 296.290009, 296.      ]],

       [[242.48999 , 242.389999, ..., 246.789993, 247.289993],
        [248.389999, 248.789993, ..., 241.98999 , 243.789993],
        ...,
        [296.98999 , 298.389984, ..., 295.589996, 294.790009],
        [298.290009, 299.290009, ..., 295.98999 , 295.48999 ]]])
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 2013-01-02 ... 2014-12-31
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Rolling Window Operations

Xarray objects include a rolling method to support rolling window aggregations:

roller = da.rolling(time=3)
roller
DataArrayRolling [window->3,center->False,dim->time]
roller.mean()
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
array([[[      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan],
        ...,
        [      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan]],

       [[      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan],
        ...,
        [      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan]],

       ...,

       [[243.92358, 243.39023, ..., 245.0904 , 245.65619],
        [249.12335, 249.0236 , ..., 241.9232 , 243.5903 ],
        ...,
        [296.68994, 297.82315, ..., 295.38968, 294.6566 ],
        [298.08994, 298.9566 , ..., 295.75665, 295.4895 ]],

       [[244.79025, 244.02356, ..., 243.32373, 243.82286],
        [249.62335, 249.19028, ..., 241.35654, 242.8903 ],
        ...,
        [296.38995, 297.32315, ..., 295.42303, 294.78992],
        [297.88995, 298.55658, ..., 295.8233 , 295.55618]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
  • We can also provide a custom function
def sum_minus_2(da, axis):
    return da.sum(axis=axis) - 273

roller.reduce(sum_minus_2)
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
array([[[      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan],
        ...,
        [      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan]],

       [[      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan],
        ...,
        [      nan,       nan, ...,       nan,       nan],
        [      nan,       nan, ...,       nan,       nan]],

       ...,

       [[458.76996, 457.16998, ..., 462.26996, 463.96997],
        [474.37   , 474.06995, ..., 452.76996, 457.76996],
        ...,
        [617.07007, 620.47   , ..., 613.1699 , 610.97   ],
        [621.27   , 623.87   , ..., 614.27   , 613.47003]],

       [[461.37   , 459.06995, ..., 456.96997, 458.46997],
        [475.87   , 474.56995, ..., 451.06995, 455.66998],
        ...,
        [616.17004, 618.97   , ..., 613.26996, 611.37   ],
        [620.67   , 622.6699 , ..., 614.47003, 613.67   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
%load_ext watermark
%watermark --iversion -g -m -v -u -d
xarray 0.12.1
last updated: 2019-05-18 

CPython 3.6.7
IPython 7.5.0

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 18.2.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : 83530e805423a8f36958a61783cbbd9fe388eace