2. Starting out with vanilla PyTorch#
Here is a high level overview of the model training process:
Objective: Classify handwritten digits (0-9)
Model: Simple Neural Network using PyTorch
Evaluation Metric: Accuracy
Dataset: MNIST
We’ll start with a basic PyTorch implementation to establish a baseline before moving on to more advanced techniques. This will give us a good foundation for understanding the benefits of hyperparameter tuning and distributed training in later sections.
device = "cuda" if torch.cuda.is_available() else "cpu"
def train_loop_torch(num_epochs: int = 2, batch_size: int = 128, lr: float = 1e-5):
criterion = CrossEntropyLoss()
model = resnet18()
model.conv1 = torch.nn.Conv2d(
1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
)
model.to(device)
data_loader = build_data_loader(batch_size)
optimizer = Adam(model.parameters(), lr=lr)
for epoch in range(num_epochs):
for images, labels in data_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Report the metrics
print(f"Epoch {epoch}, Loss: {loss}")
We fit the model by submitting it onto a GPU node using Ray Core. We’re using ray.get just to block for discussion here – the remote model fitting works fine asynchronously.
@ray.remote(num_gpus=2)
def fit_model():
train_loop_torch(num_epochs=2)
ray.get(fit_model.remote())
Can we do any better ? let’s see if we can tune the hyperparameters of our model to get a better loss.
But hyperparameter tuning is a computationally expensive task, and it will take a long time to run sequentially.
Ray Tune is a distributed hyperparameter tuning library that can help us speed up the process!