# Benchmark time series data sets for PyTorch

PyTorch data sets for supervised time series classification and prediction problems, including:

All UEA/UCR classification repository data sets

PhysioNet Challenge 2012 (in-hospital mortality)

PhysioNet Challenge 2019 (sepsis prediction)

A binary prediction variant of the 2019 PhysioNet Challenge

## Why use `torchtime`

?

Saves time. You don’t have to write your own PyTorch data classes.

Better research. Use common, reproducible implementations of data sets for a level playing field when evaluating models.

## Installation

```
$ pip install torchtime
```

## Getting started

Data classes have a common API. The `split`

argument determines whether training (”*train*”), validation (”*val*”) or test (”*test*”) data are returned. The size of the splits are controlled with the `train_prop`

and (optional) `val_prop`

arguments.

### PhysioNet data sets

Three PhysioNet data sets are currently supported:

`torchtime.data.PhysioNet2012`

returns the 2012 challenge (in-hospital mortality) [link].`torchtime.data.PhysioNet2019`

returns the 2019 challenge (sepsis prediction) [link].`torchtime.data.PhysioNet2019Binary`

returns a binary prediction variant of the 2019 challenge.

For example, to load training data for the 2012 challenge with a 70/30% training/validation split and create a DataLoader for model training:

```
from torch.utils.data import DataLoader
from torchtime.data import PhysioNet2012
physionet2012 = PhysioNet2012(
split="train",
train_prop=0.7,
)
dataloader = DataLoader(physionet2012, batch_size=32)
```

### UEA/UCR repository data sets

The `torchtime.data.UEA`

class returns the UEA/UCR repository data set specified by the `dataset`

argument, for example:

```
from torch.utils.data import DataLoader
from torchtime.data import UEA
arrowhead = UEA(
dataset="ArrowHead",
split="train",
train_prop=0.7,
)
dataloader = DataLoader(arrowhead, batch_size=32)
```

### Using the DataLoader

Batches are dictionaries of tensors `X`

, `y`

and `length`

:

`X`

are the time series data. The package follows the*batch first*convention therefore`X`

has shape (*n*,*s*,*c*) where*n*is batch size,*s*is (longest) trajectory length and*c*is the number of channels. By default, the first channel is a time stamp.`y`

are one-hot encoded labels of shape (*n*,*l*) where*l*is the number of classes.`length`

are the length of each trajectory (before padding if sequences are of irregular length) i.e. a tensor of shape (*n*).

For example, ArrowHead is a univariate time series therefore `X`

has two channels, the time stamp followed by the time series (*c* = 2). Each series has 251 observations (*s* = 251) and there are three classes (*l* = 3). For a batch size of 32:

```
next_batch = next(iter(dataloader))
next_batch["X"].shape # torch.Size([32, 251, 2])
next_batch["y"].shape # torch.Size([32, 3])
next_batch["length"].shape # torch.Size([32])
```

See Using DataLoaders for more information.

## Advanced options

Missing data can be imputed by setting

`impute`

to*mean*(replace with training data channel means) or*forward*(replace with previous observation). Alternatively a custom imputation function can be passed to the`impute`

argument.A time stamp (added by default), missing data mask and the time since previous observation can be appended with the boolean arguments

`time`

,`mask`

and`delta`

respectively.Time series data are standardised using the

`standardise`

boolean argument.The location of cached data can be changed with the

`path`

argument, for example to share a single cache location across projects.For reproducibility, an optional random

`seed`

can be specified.Missing data can be simulated using the

`missing`

argument to drop data at random from UEA/UCR data sets.

## Other resources

If you’re looking for the TensorFlow equivalent for PhysioNet data sets try medical_ts_datasets.

## Acknowledgements

`torchtime`

uses some of the data processing ideas in Kidger et al, 2020 [1] and Che et al, 2018 [2].

This work is supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).

## Citing `torchtime`

If you use this software, please cite the paper:

```
@software{darke_torchtime_2022,
author = Darke, Philip and Missier, Paolo and Bacardit, Jaume,
title = "Benchmark time series data sets for {PyTorch} - the torchtime package",
month = July,
year = 2022,
publisher={arXiv},
doi = 10.48550/arXiv.2207.12503,
url = https://doi.org/10.48550/arXiv.2207.12503,
}
```

DOIs are also available for each version of the package here.

## References

Kidger, P, Morrill, J, Foster, J,

*et al*. Neural Controlled Differential Equations for Irregular Time Series.*arXiv*2005.08926 (2020). [arXiv]Che, Z, Purushotham, S, Cho, K,

*et al*. Recurrent Neural Networks for Multivariate Time Series with Missing Values.*Sci Rep*8, 6085 (2018). [doi]Silva, I, Moody, G, Scott, DJ,

*et al*. Predicting In-Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012.*Comput Cardiol*2012;39:245-248 (2010). [hdl]Reyna, M, Josef, C, Jeter, R,

*et al*. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge.*Critical Care Medicine*48 2: 210-217 (2019). [doi]Reyna, M, Josef, C, Jeter, R,

*et al*. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0).*PhysioNet*(2019). [doi]Goldberger, A, Amaral, L, Glass, L,

*et al*. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals.*Circulation*101 (23), pp. e215–e220 (2000). [doi]Löning, M, Bagnall, A, Ganesh, S,

*et al*. sktime: A Unified Interface for Machine Learning with Time Series.*Workshop on Systems for ML at NeurIPS 2019*(2019). [doi]Löning, M, Bagnall, A, Middlehurst, M,

*et al*. alan-turing-institute/sktime: v0.10.1 (v0.10.1).*Zenodo*(2022). [doi]

## License

Released under the MIT license.