Dataset#

The dataset module provides the TimeSeriesDataset class for loading and processing time series data in a format suitable for PyTorch DataLoader.

TimeSeriesDataset#

A PyTorch Dataset class that handles time series data extraction using chunk specifications. It supports single or multiple DataFrames and automatically extracts chunks based on the provided specifications.

Purpose:

TimeSeriesDataset provides a PyTorch-compatible interface for loading time series data with chunk-based extraction. It automatically handles the extraction of encoding, decoding, and label chunks from your DataFrames based on chunk specifications.

Key Features:

  • PyTorch Compatible: Inherits from torch.utils.data.Dataset, works seamlessly with DataLoader

  • Multiple DataFrames: Supports both single DataFrame and list of DataFrames (useful for multiple time series)

  • Automatic Chunk Extraction: Uses ChunkExtractor internally to extract chunks based on specifications

  • Time Index Tracking: Optionally returns time indices for each chunk

Initialization Parameters:

  • data_frames (pd.DataFrame | list[pd.DataFrame]): Input data. Can be a single DataFrame or a list of DataFrames. If multiple DataFrames are provided, they are treated as separate time series.

  • chunk_specs (list[BaseChunkSpec]): List of chunk specifications defining what data to extract. Typically obtained from model.make_chunk_specs().

  • return_time_index (bool): If True, includes time index arrays in the output dictionary. Default is True. Time indices are useful for tracking which time steps correspond to predictions.

Properties:

  • ``data_frames`` (list[pd.DataFrame]): List of input DataFrames (always a list, even if single DataFrame was provided).

  • ``chunk_specs`` (list[BaseChunkSpec]): Chunk specifications used for extraction.

  • ``return_time_index`` (bool): Whether to include time indices in outputs.

  • ``chunk_extractors`` (list[ChunkExtractor]): Internal list of chunk extractors, one per DataFrame.

  • ``lengths`` (list[int]): List of dataset lengths for each DataFrame.

  • ``min_start_time_index`` (int): Minimum valid start time index for extraction.

Methods:

  • ``__len__()``: Returns the total number of samples in the dataset. Calculated as the sum of lengths for all DataFrames.

  • ``__getitem__(i)``: Returns a sample at index i. The index is mapped to the appropriate DataFrame and time position. Returns a dictionary with chunk tags as keys and tensors as values. If return_time_index=True, also includes time index arrays.

  • ``_preprocess()``: Internal method called during initialization. Creates ChunkExtractor instances for each DataFrame and calculates dataset lengths. Automatically called by __init__().

  • ``plot_chunks()``: Visualizes the chunk specifications as a horizontal bar chart. Useful for understanding the temporal structure of your model’s input/output windows. Requires matplotlib.

Typical Usage:

  1. Create a model (e.g., MLP) which generates chunk specifications via make_chunk_specs()

  2. Create a TimeSeriesDataset with your data and the chunk specifications

  3. Use with PyTorch DataLoader for batching

Example:

import pandas as pd
import numpy as np
from torch.utils.data import DataLoader
from deep_time_series.dataset import TimeSeriesDataset
from deep_time_series.model import MLP

# Create sample data
data = pd.DataFrame({
    'temperature': np.sin(np.arange(100)),
    'humidity': np.cos(np.arange(100)),
})

# Create model
model = MLP(
    hidden_size=64,
    encoding_length=10,
    decoding_length=5,
    target_names=['temperature'],
    nontarget_names=['humidity'],
    n_hidden_layers=2,
)

# Get chunk specifications from model
chunk_specs = model.make_chunk_specs()

# Create dataset
dataset = TimeSeriesDataset(
    data_frames=data,
    chunk_specs=chunk_specs,
    return_time_index=True,  # Include time indices in output
)

# Use with DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Iterate over batches
for batch in dataloader:
    # batch is a dictionary with keys like:
    # - 'encoding.targets': tensor of shape (batch_size, 10, 1)
    # - 'label.targets': tensor of shape (batch_size, 5, 1)
    # - 'encoding.targets.time_index': tensor of shape (batch_size, 10)
    # - 'label.targets.time_index': tensor of shape (batch_size, 5)
    pass

Multiple DataFrames:

When working with multiple time series (e.g., different sensors or locations), pass a list of DataFrames:

# Multiple time series
data1 = pd.DataFrame({'temperature': np.sin(np.arange(100))})
data2 = pd.DataFrame({'temperature': np.cos(np.arange(100))})

dataset = TimeSeriesDataset(
    data_frames=[data1, data2],
    chunk_specs=chunk_specs,
)

# The dataset will concatenate all series and track which series each sample comes from

Dataset Length:

The length of the dataset is determined by: - The length of each DataFrame - The chunk specifications (specifically, the minimum time index)

The formula is approximately: len(df) - chunk_length + 1 for each DataFrame.

Visualization:

You can visualize the chunk specifications using the plot_chunks() method:

dataset.plot_chunks()  # Shows a horizontal bar chart of chunk windows
class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#

Bases: Dataset

Parameters:
plot_chunks()[source]#
class TimeSeriesDataset(data_frames, chunk_specs, return_time_index=True)[source]#

Bases: Dataset

Parameters:
plot_chunks()[source]#