Dataset#

The dataset module provides the TimeSeriesDataset class for loading and processing time series data in a format suitable for PyTorch DataLoader.

TimeSeriesDataset#

A PyTorch Dataset class that handles time series data extraction using chunk specifications. It supports single or multiple DataFrames and automatically extracts chunks based on the provided specifications.

Purpose:

TimeSeriesDataset provides a PyTorch-compatible interface for loading time series data with chunk-based extraction. It automatically handles the extraction of encoding, decoding, and label chunks from your DataFrames based on chunk specifications.

Key Features:

PyTorch Compatible: Inherits from torch.utils.data.Dataset, works seamlessly with DataLoader
Multiple DataFrames: Supports both single DataFrame and list of DataFrames (useful for multiple time series)
Automatic Chunk Extraction: Uses ChunkExtractor internally to extract chunks based on specifications
Time Index Tracking: Optionally returns time indices for each chunk

Initialization Parameters:

data_frames (pd.DataFrame | list[pd.DataFrame]): Input data. Can be a single DataFrame or a list of DataFrames. If multiple DataFrames are provided, they are treated as separate time series.
chunk_specs (list[BaseChunkSpec]): List of chunk specifications defining what data to extract. Typically obtained from model.make_chunk_specs().
return_time_index (bool): If True, includes time index arrays in the output dictionary. Default is True. Time indices are useful for tracking which time steps correspond to predictions.

Properties:

``data_frames`` (list[pd.DataFrame]): List of input DataFrames (always a list, even if single DataFrame was provided).
``chunk_specs`` (list[BaseChunkSpec]): Chunk specifications used for extraction.
``return_time_index`` (bool): Whether to include time indices in outputs.
``chunk_extractors`` (list[ChunkExtractor]): Internal list of chunk extractors, one per DataFrame.
``lengths`` (list[int]): List of dataset lengths for each DataFrame.
``min_start_time_index`` (int): Minimum valid start time index for extraction.

Methods:

``__len__()``: Returns the total number of samples in the dataset. Calculated as the sum of lengths for all DataFrames.
``__getitem__(i)``: Returns a sample at index i. The index is mapped to the appropriate DataFrame and time position. Returns a dictionary with chunk tags as keys and tensors as values. If return_time_index=True, also includes time index arrays.
``_preprocess()``: Internal method called during initialization. Creates ChunkExtractor instances for each DataFrame and calculates dataset lengths. Automatically called by __init__().
``plot_chunks()``: Visualizes the chunk specifications as a horizontal bar chart. Useful for understanding the temporal structure of your model’s input/output windows. Requires matplotlib.

Typical Usage:

Create a model (e.g., MLP) which generates chunk specifications via make_chunk_specs()
Create a TimeSeriesDataset with your data and the chunk specifications
Use with PyTorch DataLoader for batching

Example:

import pandas as pd
import numpy as np
from torch.utils.data import DataLoader
from deep_time_series.dataset import TimeSeriesDataset
from deep_time_series.model import MLP

# Create sample data
data = pd.DataFrame({
    'temperature': np.sin(np.arange(100)),
    'humidity': np.cos(np.arange(100)),
})

# Create model
model = MLP(
    hidden_size=64,
    encoding_length=10,
    decoding_length=5,
    target_names=['temperature'],
    nontarget_names=['humidity'],
    n_hidden_layers=2,
)

# Get chunk specifications from model
chunk_specs = model.make_chunk_specs()

# Create dataset
dataset = TimeSeriesDataset(
    data_frames=data,
    chunk_specs=chunk_specs,
    return_time_index=True,  # Include time indices in output
)

# Use with DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate over batches
for batch in dataloader:
    # batch is a dictionary with keys like:
    # - 'encoding.targets': tensor of shape (batch_size, 10, 1)
    # - 'label.targets': tensor of shape (batch_size, 5, 1)
    # - 'encoding.targets.time_index': tensor of shape (batch_size, 10)
    # - 'label.targets.time_index': tensor of shape (batch_size, 5)
    pass

Multiple DataFrames:

When working with multiple time series (e.g., different sensors or locations), pass a list of DataFrames:

# Multiple time series
data1 = pd.DataFrame({'temperature': np.sin(np.arange(100))})
data2 = pd.DataFrame({'temperature': np.cos(np.arange(100))})

dataset = TimeSeriesDataset(
    data_frames=[data1, data2],
    chunk_specs=chunk_specs,
)

# The dataset will concatenate all series and track which series each sample comes from

Dataset Length:

The length of the dataset is determined by: - The length of each DataFrame - The chunk specifications (specifically, the minimum time index)

The formula is approximately: len(df) - chunk_length + 1 for each DataFrame.

Visualization:

You can visualize the chunk specifications using the plot_chunks() method:

dataset.plot_chunks()  # Shows a horizontal bar chart of chunk windows

class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#

Bases: Dataset

Parameters:

data_frames (DataFrame | list[pandas.core.frame.DataFrame]) –
chunk_specs (list[deep_time_series.chunk.BaseChunkSpec]) –
return_time_index (bool) –

plot_chunks()[source]#

class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#

Bases: Dataset

Parameters:

data_frames (DataFrame | list[pandas.core.frame.DataFrame]) –
chunk_specs (list[deep_time_series.chunk.BaseChunkSpec]) –
return_time_index (bool) –

plot_chunks()[source]#

Dataset#

TimeSeriesDataset#

This Page