2. Starting out with vanilla PyTorch

2. Starting out with vanilla PyTorch#

Here is a high level overview of the model training process:

Objective: Classify handwritten digits (0-9)
Model: Simple Neural Network using PyTorch
Evaluation Metric: Accuracy
Dataset: MNIST

We’ll start with a basic PyTorch implementation to establish a baseline before moving on to more advanced techniques. This will give us a good foundation for understanding the benefits of hyperparameter tuning and distributed training in later sections.

device = "cuda" if torch.cuda.is_available() else "cpu"

def train_loop_torch(num_epochs: int = 2, batch_size: int = 128, lr: float = 1e-5):
    criterion = CrossEntropyLoss()

    model = resnet18()
    model.conv1 = torch.nn.Conv2d(
        1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
    )
    model.to(device)
    data_loader = build_data_loader(batch_size)
    optimizer = Adam(model.parameters(), lr=lr)

    for epoch in range(num_epochs):
        for images, labels in data_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Report the metrics
        print(f"Epoch {epoch}, Loss: {loss}")

We fit the model by submitting it onto a GPU node using Ray Core. We’re using ray.get just to block for discussion here – the remote model fitting works fine asynchronously.

@ray.remote(num_gpus=2)
def fit_model():
    train_loop_torch(num_epochs=2)
    
ray.get(fit_model.remote())

Can we do any better ? let’s see if we can tune the hyperparameters of our model to get a better loss.

But hyperparameter tuning is a computationally expensive task, and it will take a long time to run sequentially.

Ray Tune is a distributed hyperparameter tuning library that can help us speed up the process!