Custom Vision Models with PyTorch: From Idea to Production
When Do You Need a Custom Vision Model?
Pre-trained models like YOLO or ResNet solve many standard tasks. But for specialized use cases - such as quality control in manufacturing or classification of company-specific products - there’s no way around custom models.
At innFactory, we develop custom vision models for various industries. In this article, we share our proven practices.
The Computer Vision Development Process
1. Data Collection & Annotation
Data quality determines success. Our recommendations:
Minimum Data Amount:
- Classification: 100-500 images per class
- Object Detection: 500-2000 annotated bounding boxes per class
- Segmentation: 200-500 pixel-accurate masks per class
Annotation Tools:
- Label Studio - Open Source, flexible
- CVAT - Specialized for video
- Roboflow - All-in-one with augmentation
Best Practices:
# Data structure for classification
data/
├── train/
│ ├── class_a/
│ │ ├── img_001.jpg
│ │ └── ...
│ └── class_b/
├── val/
└── test/2. Transfer Learning Setup
Never start from scratch. Pre-trained backbones significantly accelerate training:
import torch
import torchvision.models as models
# ResNet50 with ImageNet weights
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# Adapt last layer for own classes
num_classes = 5
model.fc = torch.nn.Linear(model.fc.in_features, num_classes)
# Freeze backbone (optional for small datasets)
for param in model.parameters():
param.requires_grad = False
# Train only classifier
for param in model.fc.parameters():
param.requires_grad = True3. Data Augmentation
Artificial data multiplication is essential:
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(
brightness=0.2,
contrast=0.2,
saturation=0.2
),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# For production: Albumentations is faster
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_transform = A.Compose([
A.RandomResizedCrop(224, 224),
A.HorizontalFlip(p=0.5),
A.ShiftScaleRotate(p=0.5),
A.Normalize(),
ToTensorV2(),
])4. Training Loop
A production-ready training loop:
import torch
from torch.cuda.amp import GradScaler, autocast
from tqdm import tqdm
def train_epoch(model, loader, optimizer, criterion, scaler, device):
model.train()
total_loss = 0
correct = 0
total = 0
for images, labels in tqdm(loader):
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
# Mixed Precision for faster training
with autocast():
outputs = model(images)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
total_loss += loss.item()
_, predicted = outputs.max(1)
correct += predicted.eq(labels).sum().item()
total += labels.size(0)
return total_loss / len(loader), correct / total
# Training
scaler = GradScaler()
for epoch in range(num_epochs):
train_loss, train_acc = train_epoch(
model, train_loader, optimizer, criterion, scaler, device
)
val_loss, val_acc = validate(model, val_loader, criterion, device)
# Learning Rate Scheduling
scheduler.step(val_loss)
# Early Stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
break5. Hyperparameter Tuning
Important hyperparameters and starting values:
| Parameter | Recommendation | Range |
|---|---|---|
| Learning Rate | 1e-4 (Transfer) | 1e-5 - 1e-3 |
| Batch Size | 32 | 16-128 |
| Weight Decay | 1e-4 | 1e-5 - 1e-3 |
| Epochs | 50-100 | - |
| Image Size | 224 | 224-512 |
For systematic tuning: Optuna or Weights & Biases Sweeps
6. Model Export & Optimization
Optimize the model for production:
# TorchScript for deployment
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
# ONNX for Cross-Platform
dummy_input = torch.randn(1, 3, 224, 224).to(device)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
opset_version=13,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}}
)
# Quantization for Edge Devices
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)7. Deployment Options
| Option | Latency | Cost | Use Case |
|---|---|---|---|
| AWS SageMaker | ~100ms | €€€ | Enterprise |
| Google Vertex AI | ~100ms | €€€ | GCP Environment |
| TorchServe | ~50ms | €€ | Self-hosted |
| ONNX Runtime | ~20ms | € | Edge/On-Premise |
| TensorRT | ~10ms | € | NVIDIA GPUs |
Practical Example: Quality Control
A typical use case at innFactory:
Task: Automatic detection of product defects
Solution:
- 2000 images with 4 defect classes annotated
- EfficientNet-B3 as backbone
- Augmentation with Albumentations
- Training on AWS p3.2xlarge (1x V100)
- Deployment as ONNX on edge device
Result:
- 98.5% accuracy on test data
- <50ms inference time per image
- ROI: Savings of 2 FTE in manual inspection
Conclusion
Custom vision models require:
- High-quality training data - Invest in annotation
- Transfer Learning - Use pre-trained models
- Systematic experimentation - Hyperparameter tuning is crucial
- Production-ready architecture - Export, optimization, monitoring
At innFactory, we accompany you through the entire process - from data analysis to the productive vision system.
Planning a Computer Vision project? Contact us for a non-binding consultation.

Tobias Jonas


