Model Tutorials
================

This tutorial covers the different models available in DeepTimeSeries and how to use them.

All models in DeepTimeSeries inherit from ``ForecastingModule`` and follow the same interface,
making it easy to switch between different architectures.

MLP Model
---------

The MLP (Multi-Layer Perceptron) model is a simple feedforward neural network that flattens
the encoding window and processes it through fully connected layers.

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    import numpy as np
    import pandas as pd
    import pytorch_lightning as pl
    from torch.utils.data import DataLoader
    from sklearn.preprocessing import StandardScaler
    
    import deep_time_series as dts
    from deep_time_series.model import MLP
    
    # Prepare data
    data = pd.DataFrame({
        'target': np.sin(np.arange(100)),
        'feature': np.cos(np.arange(100))
    })
    
    # Preprocess
    transformer = dts.ColumnTransformer(
        transformer_tuples=[(StandardScaler(), ['target', 'feature'])]
    )
    data = transformer.fit_transform(data)
    
    # Create MLP model
    model = MLP(
        hidden_size=64,
        encoding_length=10,
        decoding_length=5,
        target_names=['target'],
        nontarget_names=['feature'],
        n_hidden_layers=2,
        activation=torch.nn.ELU,
        dropout_rate=0.1,
    )
    
    # Create dataset and train
    dataset = dts.TimeSeriesDataset(
        data_frames=data,
        chunk_specs=model.make_chunk_specs()
    )
    dataloader = DataLoader(dataset, batch_size=32)
    
    trainer = pl.Trainer(max_epochs=10)
    trainer.fit(model, train_dataloaders=dataloader)

Parameters
~~~~~~~~~~

- ``hidden_size``: Size of hidden layers
- ``encoding_length``: Length of encoding window
- ``decoding_length``: Length of decoding window
- ``target_names``: List of target feature names
- ``nontarget_names``: List of non-target feature names
- ``n_hidden_layers``: Number of hidden layers
- ``activation``: Activation function class (default: ``nn.ELU``)
- ``dropout_rate``: Dropout rate (default: 0.0)

RNN Models
----------

The RNN model supports vanilla RNN, LSTM, and GRU architectures.
It uses a recurrent encoder and decoder for sequential processing.

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    import torch.nn as nn
    from deep_time_series.model import RNN
    
    # Create LSTM model
    model = RNN(
        hidden_size=128,
        encoding_length=20,
        decoding_length=10,
        target_names=['target'],
        nontarget_names=['feature'],
        n_layers=2,
        rnn_class=nn.LSTM,  # or nn.RNN, nn.GRU
        dropout_rate=0.1,
    )
    
    # Use the same way as MLP
    dataset = dts.TimeSeriesDataset(
        data_frames=data,
        chunk_specs=model.make_chunk_specs()
    )
    dataloader = DataLoader(dataset, batch_size=32)
    
    trainer = pl.Trainer(max_epochs=10)
    trainer.fit(model, train_dataloaders=dataloader)

RNN Variants
~~~~~~~~~~~~

You can use different RNN variants:

.. code-block:: python

    # Vanilla RNN
    model_rnn = RNN(..., rnn_class=nn.RNN)
    
    # LSTM
    model_lstm = RNN(..., rnn_class=nn.LSTM)
    
    # GRU
    model_gru = RNN(..., rnn_class=nn.GRU)

Parameters
~~~~~~~~~~

- ``hidden_size``: Hidden state size
- ``encoding_length``: Length of encoding window
- ``decoding_length``: Length of decoding window
- ``target_names``: List of target feature names
- ``nontarget_names``: List of non-target feature names
- ``n_layers``: Number of RNN layers
- ``rnn_class``: RNN class (``nn.RNN``, ``nn.LSTM``, or ``nn.GRU``)
- ``dropout_rate``: Dropout rate between RNN layers

Dilated CNN Model
-----------------

The Dilated CNN model uses dilated convolutions to capture long-range dependencies
in time series data. It's particularly effective for sequences with periodic patterns.

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    from deep_time_series.model import DilatedCNN
    
    model = DilatedCNN(
        hidden_size=64,
        encoding_length=30,
        decoding_length=10,
        target_names=['target'],
        nontarget_names=['feature'],
        dilation_base=2,
        kernel_size=3,
        activation=torch.nn.ELU,
        dropout_rate=0.1,
    )
    
    dataset = dts.TimeSeriesDataset(
        data_frames=data,
        chunk_specs=model.make_chunk_specs()
    )
    dataloader = DataLoader(dataset, batch_size=32)
    
    trainer = pl.Trainer(max_epochs=10)
    trainer.fit(model, train_dataloaders=dataloader)

How Dilated CNN Works
~~~~~~~~~~~~~~~~~~~~~

The model automatically calculates the number of layers needed based on:
- ``encoding_length``: The input sequence length
- ``dilation_base``: Base for exponential dilation (e.g., 2 means dilations: 1, 2, 4, 8, ...)
- ``kernel_size``: Size of convolutional kernel

The dilation increases exponentially with each layer, allowing the model to capture
dependencies at different time scales.

Parameters
~~~~~~~~~~

- ``hidden_size``: Number of convolutional filters
- ``encoding_length``: Length of encoding window
- ``decoding_length``: Length of decoding window
- ``target_names``: List of target feature names
- ``nontarget_names``: List of non-target feature names
- ``dilation_base``: Base for exponential dilation (typically 2)
- ``kernel_size``: Size of convolutional kernel (must be >= dilation_base)
- ``activation``: Activation function class (default: ``nn.ELU``)
- ``dropout_rate``: Dropout rate (default: 0.0)

Transformer Model
-----------------

The SingleShotTransformer model uses a transformer architecture with encoder-decoder structure.
It's effective for capturing complex temporal dependencies and long-range patterns.

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    from deep_time_series.model import SingleShotTransformer
    
    model = SingleShotTransformer(
        encoding_length=30,
        decoding_length=10,
        target_names=['target'],
        nontarget_names=['feature'],
        d_model=128,
        n_heads=8,
        n_layers=4,
        dim_feedforward=512,
        dropout_rate=0.1,
    )
    
    dataset = dts.TimeSeriesDataset(
        data_frames=data,
        chunk_specs=model.make_chunk_specs()
    )
    dataloader = DataLoader(dataset, batch_size=32)
    
    trainer = pl.Trainer(max_epochs=10)
    trainer.fit(model, train_dataloaders=dataloader)

Transformer Architecture
~~~~~~~~~~~~~~~~~~~~~~~~

The transformer uses:
- **Encoder**: Processes the encoding window with self-attention
- **Decoder**: Generates predictions using cross-attention to encoder outputs
- **Positional Encoding**: Adds positional information to inputs
- **Causal Masking**: Prevents decoder from seeing future information during training

Parameters
~~~~~~~~~~

- ``encoding_length``: Length of encoding window
- ``decoding_length``: Length of decoding window
- ``target_names``: List of target feature names
- ``nontarget_names``: List of non-target feature names
- ``d_model``: Dimension of model (embedding size)
- ``n_heads``: Number of attention heads
- ``n_layers``: Number of encoder/decoder layers
- ``dim_feedforward``: Dimension of feedforward network (default: 4 * d_model)
- ``dropout_rate``: Dropout rate (default: 0.0)

Model Comparison
----------------

Choosing the Right Model
~~~~~~~~~~~~~~~~~~~~~~~~~

**MLP**
    - Simple and fast
    - Good for short sequences and simple patterns
    - No explicit temporal modeling

**RNN (LSTM/GRU)**
    - Good for sequential dependencies
    - Can handle variable-length sequences
    - May struggle with very long sequences

**Dilated CNN**
    - Efficient for long sequences
    - Good for periodic patterns
    - Parallel processing (faster than RNN)

**Transformer**
    - Best for complex patterns and long-range dependencies
    - Most flexible but computationally expensive
    - Requires more data to train effectively

Feature Support
~~~~~~~~~~~~~~~

All models support:

- **Target features**: Variables to predict
- **Non-target features**: Additional features known at prediction time
- **Deterministic forecasting**: Point predictions
- **Probabilistic forecasting**: Distribution predictions (with DistributionHead)

Example: Using Non-Target Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All models can use non-target features:

.. code-block:: python

    # Model with non-target features
    model = MLP(
        hidden_size=64,
        encoding_length=10,
        decoding_length=5,
        target_names=['target'],
        nontarget_names=['feature1', 'feature2'],  # Multiple features
        n_hidden_layers=2,
    )

The non-target features are:
- Used during encoding (along with target features)
- Available during decoding (future values must be known)

Customizing Models
------------------

All models support custom heads, loss functions, and optimizers:

.. code-block:: python

    import torch.nn as nn
    from deep_time_series.core import Head
    
    # Custom head with L1 loss
    custom_head = Head(
        tag='targets',
        output_module=nn.Linear(64, 1),
        loss_fn=nn.L1Loss(),
        loss_weight=1.0,
    )
    
    model = MLP(
        hidden_size=64,
        encoding_length=10,
        decoding_length=5,
        target_names=['target'],
        nontarget_names=[],
        n_hidden_layers=2,
        head=custom_head,  # Use custom head
    )

For more advanced customization, see :doc:`advanced`.