This tutorial provides step by step instruction for using native amp introduced in PyTorch 1.6. Often times, its good to try stuffs using simple examples especially if they are related to graident updates. Scientists need to be careful while using mixed precission and write proper test cases. A single mis-step can result is model divergence or unexpected error. This tutorial uses a simple 1x1 linear layer and converts an FP32 model training to mixed precission model training. Weights and Gradients are printed at every stage to ensure correctness.

PyTorch official documentation

import torch
print('torch version' , torch.__version__)

!nvidia-smi
!cat /usr/local/cuda/version.txt
torch version 1.6.0
Wed Aug  5 02:29:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0    28W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000002:00:00.0 Off |                    0 |
| N/A   30C    P0    23W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
CUDA Version 10.1.243

Install pytorch

Open terminal

conda env list
conda activate azureml_py36_pytorch
conda install pytorch=1.6 torchvision cudatoolkit=10.1 -c pytorch

Create a dummy model

torch.manual_seed(47)


class MyModel(torch.nn.Module):
    def __init__(self, input_size = 1):
        super().__init__()
        self.linear = torch.nn.Linear(input_size,1)
        
    def forward(self, x):
        return self.linear(x)

model = MyModel()
model
MyModel(
  (linear): Linear(in_features=1, out_features=1, bias=True)
)
#print parameters
def print_model_params(model):
    for name, param in model.named_parameters():
        print( 'Param Name = {}, value = {}, gradient = {}'
              .format(name , param.data, param.grad))
        
print_model_params(model)
Param Name = linear.weight, value = tensor([[-0.8939]]), gradient = None
Param Name = linear.bias, value = tensor([-0.9002]), gradient = None
#input
x = torch.randn(1,1)
x
tensor([[-0.0591]])

Train model FP32


optimizer = torch.optim.SGD(model.parameters(), lr = 1)

def train_step(model, x):
    print('\nRunning forward pass, input = ',x)
    output = model(x)
    print('output = ', output)
    
    print('\nRunning backward pass')
    output.backward()
    print('\nAfter backward pass')
    print_model_params(model)
    
    print('\nAfter optimizer step')
    optimizer.step()
    print('\nAfter updating model weights')
    
    print_model_params(model)
    optimizer.zero_grad()
    
    print('\nAfter setting gradients to zero')  
    print_model_params(model)
    
train_step(model, x)
Running forward pass, input =  tensor([[-0.0591]])
output =  tensor([[-0.8473]], grad_fn=<AddmmBackward>)

Running backward pass

After backward pass
Param Name = linear.weight, value = tensor([[-0.8939]]), gradient = tensor([[-0.0591]])
Param Name = linear.bias, value = tensor([-0.9002]), gradient = tensor([1.])

After optimizer step

After updating model weights
Param Name = linear.weight, value = tensor([[-0.8348]]), gradient = tensor([[-0.0591]])
Param Name = linear.bias, value = tensor([-1.9002]), gradient = tensor([1.])

After setting gradients to zero
Param Name = linear.weight, value = tensor([[-0.8348]]), gradient = tensor([[0.]])
Param Name = linear.bias, value = tensor([-1.9002]), gradient = tensor([0.])

Train model with AMP

from torch.cuda.amp import autocast, GradScaler

#grad scaler only works on GPU
model = model.to('cuda:0')
x = x.to('cuda:0')

optimizer = torch.optim.SGD(model.parameters(), lr = 1)
scaler = GradScaler(init_scale=4096) 


def train_step_amp(model, x):
    with autocast():
        print('\nRunning forward pass, input = ',x)
        output = model(x)
        print('output = ', output)
    
    print('\nRunning backward pass')
    scaler.scale(output).backward()
    print('\nAfter backward pass')
    print_model_params(model)
    
#     scaler.unscale_(optimizer) #optional 
#     print('\nAfter Unscaling')
#     print_model_params(model)
    
    scaler.step(optimizer) # do not use optimizer step as it will step over inf and nan values too.
    print('\nAfter updating model weights')
    
    print_model_params(model)
    optimizer.zero_grad()
    
    print('\nAfter setting gradients to zero')  
    print_model_params(model)
    
    scaler.update()
    
train_step_amp(model, x)
Running forward pass, input =  tensor([[-0.0591]], device='cuda:0')
output =  tensor([[-1.8506]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>)

Running backward pass

After backward pass
Param Name = linear.weight, value = tensor([[-0.8348]], device='cuda:0'), gradient = tensor([[-242.2500]], device='cuda:0')
Param Name = linear.bias, value = tensor([-1.9002], device='cuda:0'), gradient = tensor([4096.], device='cuda:0')

After updating model weights
Param Name = linear.weight, value = tensor([[-0.7756]], device='cuda:0'), gradient = tensor([[-0.0591]], device='cuda:0')
Param Name = linear.bias, value = tensor([-2.9002], device='cuda:0'), gradient = tensor([1.], device='cuda:0')

After setting gradients to zero
Param Name = linear.weight, value = tensor([[-0.7756]], device='cuda:0'), gradient = tensor([[0.]], device='cuda:0')
Param Name = linear.bias, value = tensor([-2.9002], device='cuda:0'), gradient = tensor([0.], device='cuda:0')

The gradients are scaled and unscaled properly. Also the forward pass and backward pass are run using mixed precission. The timing difference between both runs will confirm the mixed precission training. This will be showcased in future blogs.

Native amp support makes it easy to do fast experimentation without using apex related dependencies.