Dataset
=======

The dataset module provides the :class:`TimeSeriesDataset` class for loading and processing time series data in a format suitable for PyTorch DataLoader.

TimeSeriesDataset
-----------------

A PyTorch Dataset class that handles time series data extraction using chunk specifications. It supports single or multiple DataFrames and automatically extracts chunks based on the provided specifications.

**Purpose:**

``TimeSeriesDataset`` provides a PyTorch-compatible interface for loading time series data with chunk-based extraction. It automatically handles the extraction of encoding, decoding, and label chunks from your DataFrames based on chunk specifications.

**Key Features:**

- **PyTorch Compatible**: Inherits from ``torch.utils.data.Dataset``, works seamlessly with ``DataLoader``
- **Multiple DataFrames**: Supports both single DataFrame and list of DataFrames (useful for multiple time series)
- **Automatic Chunk Extraction**: Uses ``ChunkExtractor`` internally to extract chunks based on specifications
- **Time Index Tracking**: Optionally returns time indices for each chunk

**Initialization Parameters:**

- ``data_frames`` (pd.DataFrame | list[pd.DataFrame]): Input data. Can be a single DataFrame or a list of DataFrames. If multiple DataFrames are provided, they are treated as separate time series.

- ``chunk_specs`` (list[BaseChunkSpec]): List of chunk specifications defining what data to extract. Typically obtained from ``model.make_chunk_specs()``.

- ``return_time_index`` (bool): If ``True``, includes time index arrays in the output dictionary. Default is ``True``. Time indices are useful for tracking which time steps correspond to predictions.

**Properties:**

- **``data_frames``** (list[pd.DataFrame]): List of input DataFrames (always a list, even if single DataFrame was provided).

- **``chunk_specs``** (list[BaseChunkSpec]): Chunk specifications used for extraction.

- **``return_time_index``** (bool): Whether to include time indices in outputs.

- **``chunk_extractors``** (list[ChunkExtractor]): Internal list of chunk extractors, one per DataFrame.

- **``lengths``** (list[int]): List of dataset lengths for each DataFrame.

- **``min_start_time_index``** (int): Minimum valid start time index for extraction.

**Methods:**

- **``__len__()``**: Returns the total number of samples in the dataset. Calculated as the sum of lengths for all DataFrames.

- **``__getitem__(i)``**: Returns a sample at index ``i``. The index is mapped to the appropriate DataFrame and time position. Returns a dictionary with chunk tags as keys and tensors as values. If ``return_time_index=True``, also includes time index arrays.

- **``_preprocess()``**: Internal method called during initialization. Creates ``ChunkExtractor`` instances for each DataFrame and calculates dataset lengths. Automatically called by ``__init__()``.

- **``plot_chunks()``**: Visualizes the chunk specifications as a horizontal bar chart. Useful for understanding the temporal structure of your model's input/output windows. Requires matplotlib.

**Typical Usage:**

1. Create a model (e.g., ``MLP``) which generates chunk specifications via ``make_chunk_specs()``
2. Create a ``TimeSeriesDataset`` with your data and the chunk specifications
3. Use with PyTorch ``DataLoader`` for batching

**Example:**

.. code-block:: python

   import pandas as pd
   import numpy as np
   from torch.utils.data import DataLoader
   from deep_time_series.dataset import TimeSeriesDataset
   from deep_time_series.model import MLP

   # Create sample data
   data = pd.DataFrame({
       'temperature': np.sin(np.arange(100)),
       'humidity': np.cos(np.arange(100)),
   })

   # Create model
   model = MLP(
       hidden_size=64,
       encoding_length=10,
       decoding_length=5,
       target_names=['temperature'],
       nontarget_names=['humidity'],
       n_hidden_layers=2,
   )

   # Get chunk specifications from model
   chunk_specs = model.make_chunk_specs()

   # Create dataset
   dataset = TimeSeriesDataset(
       data_frames=data,
       chunk_specs=chunk_specs,
       return_time_index=True,  # Include time indices in output
   )

   # Use with DataLoader
   dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

   # Iterate over batches
   for batch in dataloader:
       # batch is a dictionary with keys like:
       # - 'encoding.targets': tensor of shape (batch_size, 10, 1)
       # - 'label.targets': tensor of shape (batch_size, 5, 1)
       # - 'encoding.targets.time_index': tensor of shape (batch_size, 10)
       # - 'label.targets.time_index': tensor of shape (batch_size, 5)
       pass

**Multiple DataFrames:**

When working with multiple time series (e.g., different sensors or locations), pass a list of DataFrames:

.. code-block:: python

   # Multiple time series
   data1 = pd.DataFrame({'temperature': np.sin(np.arange(100))})
   data2 = pd.DataFrame({'temperature': np.cos(np.arange(100))})

   dataset = TimeSeriesDataset(
       data_frames=[data1, data2],
       chunk_specs=chunk_specs,
   )

   # The dataset will concatenate all series and track which series each sample comes from

**Dataset Length:**

The length of the dataset is determined by:
- The length of each DataFrame
- The chunk specifications (specifically, the minimum time index)

The formula is approximately: ``len(df) - chunk_length + 1`` for each DataFrame.

**Visualization:**

You can visualize the chunk specifications using the ``plot_chunks()`` method:

.. code-block:: python

   dataset.plot_chunks()  # Shows a horizontal bar chart of chunk windows

.. automodule:: deep_time_series.dataset
   :members:
   :undoc-members:
   :show-inheritance:

.. autoclass:: deep_time_series.dataset.TimeSeriesDataset
   :members:
   :undoc-members:
   :show-inheritance: