Augmentation

Improving training dynamics even further

Adapted from:

set_seed(42)
plt.style.use("ggplot")

Let’s redefine the training loop for clarity.

source

train

 train (model, lr=0.01, n_epochs=2, dls=<slowai.learner.DataLoaders object
        at 0x7f52478d40d0>, extra_cbs=())

Going wider

Can we get a better result by increasing the width of our network?

We didn’t spend much time designing the Residual CNN from the previous notebook. We simply replaced the Conv blocks with Residual Conv blocks, doubling the number of parameters.

In principle, ResNet’s are more stable than their CNN counterparts, so we should be able to make the network wider as well as deeper.

source

ResNet

 ResNet (nfs:Sequence[int]=[16, 32, 64, 128, 256, 512], n_outputs=10)

Arbitrarily wide and deep residual neural network

m = ResNet.kaiming()
train(m)

MulticlassAccuracy	loss	epoch	train
0.831	0.474	0	train
0.870	0.378	0	eval
0.911	0.241	1	train
0.914	0.233	1	eval

Let’s create quick utility to view the shape of the model to check for areas of improvement

source

summarize

 summarize (m, mods, dls=None, xb_=None)

source

hooks

 hooks (mods, f)

source

flops

 flops (x, w, h)

Estimate flops

summarize(ResNet(), "ResidualConvBlock")

Type	Input	Output	N. params	MFlops
ResidualConvBlock	(8, 1, 28, 28)	(8, 16, 28, 28)	6,896	5.3
ResidualConvBlock	(8, 16, 28, 28)	(8, 32, 14, 14)	14,496	2.8
ResidualConvBlock	(8, 32, 14, 14)	(8, 64, 7, 7)	57,664	2.8
ResidualConvBlock	(8, 64, 7, 7)	(8, 128, 4, 4)	230,016	3.7
ResidualConvBlock	(8, 128, 4, 4)	(8, 256, 2, 2)	918,784	3.7
ResidualConvBlock	(8, 256, 2, 2)	(8, 512, 1, 1)	3,672,576	3.7
ResidualConvBlock	(8, 512, 1, 1)	(8, 10, 1, 1)	52,150	0.1
Total			4,952,582

One of the important constaints of our model here is that that strides must be configured to downsample the image precisely to bs x c x 1 x 1. We can make this more flexible by taking the final feature map (regardless of its height and width) and taking the average.

source

GlobalAveragePooling

 GlobalAveragePooling (*args, **kwargs)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

source

ResNetWithGlobalPooling

 ResNetWithGlobalPooling (nfs:Sequence[int]=[16, 32, 64, 128, 256, 512],
                          n_outputs=10)

Arbitrarily wide and deep residual neural network

nfs = [
    16,
    32,
    64,
    128,
    256,  # 👈 notice that this leaves the feature map at 2x2...
]
m = ResNetWithGlobalPooling.kaiming(nfs)
summarize(m, "ResidualConvBlock|Linear")

Type	Input	Output	N. params	MFlops
ResidualConvBlock	(8, 1, 28, 28)	(8, 16, 28, 28)	6,896	5.3
ResidualConvBlock	(8, 16, 28, 28)	(8, 32, 14, 14)	14,496	2.8
ResidualConvBlock	(8, 32, 14, 14)	(8, 64, 7, 7)	57,664	2.8
ResidualConvBlock	(8, 64, 7, 7)	(8, 128, 4, 4)	230,016	3.7
ResidualConvBlock	(8, 128, 4, 4)	(8, 256, 2, 2)	918,784	3.7
Linear	(8, 256)	(8, 10)	2,570	0.0
Total			1,230,426

# ...but it still works!
train(m)

MulticlassAccuracy	loss	epoch	train
0.848	0.581	0	train
0.840	0.477	0	eval
0.913	0.287	1	train
0.916	0.271	1	eval

Can we reduce the number of parameters to save on memory? Indeed. One thing to focus on is the first ResidualConvBlock which has the most MegaFlops, because it applies the 16 kernels to each pixel. We can try replacing it with a Conv.

source

ResNetWithGlobalPoolingInitialConv

 ResNetWithGlobalPoolingInitialConv (nfs:Sequence[int]=[16, 32, 64, 128,
                                     256, 512], n_outputs=10)

Arbitrarily wide and deep residual neural network

m = ResNetWithGlobalPoolingInitialConv.kaiming()
summarize(m, [*m.layers, m.lin, m.norm])

Type	Input	Output	N. params	MFlops
Conv	(8, 1, 28, 28)	(8, 16, 28, 28)	432	0.3
ResidualConvBlock	(8, 16, 28, 28)	(8, 32, 14, 14)	14,496	2.8
ResidualConvBlock	(8, 32, 14, 14)	(8, 64, 7, 7)	57,664	2.8
ResidualConvBlock	(8, 64, 7, 7)	(8, 128, 4, 4)	230,016	3.7
ResidualConvBlock	(8, 128, 4, 4)	(8, 256, 2, 2)	918,784	3.7
ResidualConvBlock	(8, 256, 2, 2)	(8, 512, 1, 1)	3,672,576	3.7
Linear	(8, 512)	(8, 10)	5,130	0.0
BatchNorm1d	(8, 10)	(8, 10)	20	0.0
Total			4,899,118

train(m)

MulticlassAccuracy	loss	epoch	train
0.850	0.565	0	train
0.843	0.466	0	eval
0.913	0.281	1	train
0.916	0.265	1	eval

This approach yeilds a small, flexible and competitive model. What happens if we train for a while?

train(ResNetWithGlobalPoolingInitialConv.kaiming(), n_epochs=20)

MulticlassAccuracy	loss	epoch	train
0.846	0.662	0	train
0.876	0.527	0	eval
0.898	0.456	1	train
0.888	0.411	1	eval
0.907	0.353	2	train
0.888	0.367	2	eval
0.912	0.288	3	train
0.838	0.506	3	eval
0.918	0.250	4	train
0.856	0.418	4	eval
0.925	0.221	5	train
0.878	0.360	5	eval
0.935	0.192	6	train
0.901	0.287	6	eval
0.943	0.167	7	train
0.904	0.289	7	eval
0.950	0.148	8	train
0.902	0.301	8	eval
0.955	0.131	9	train
0.910	0.283	9	eval
0.959	0.115	10	train
0.910	0.296	10	eval
0.966	0.099	11	train
0.910	0.289	11	eval
0.974	0.077	12	train
0.913	0.317	12	eval
0.978	0.063	13	train
0.914	0.299	13	eval
0.984	0.048	14	train
0.919	0.289	14	eval
0.991	0.031	15	train
0.922	0.296	15	eval
0.996	0.017	16	train
0.927	0.293	16	eval
0.999	0.008	17	train
0.929	0.293	17	eval
1.000	0.006	18	train
0.928	0.293	18	eval
1.000	0.005	19	train
0.928	0.293	19	eval

The near perfect training accuracy indicates that the model is simply memorizing the dataset and failing to generalize.

We’ve discussed weight decay as a regularization technique. Could this help generalization?

Weight Decay and Batchnorm do not work together

We’ve posited that weight decay, as a regularization, prevents memorization. However, batch norm has a single set of coefficients that scale the layer output. Since weight decay also scales the layer weight, the model is able to “cheat.” Jeremy says to avoid weight decay and rely on a scheduler.

Instead, let’s try “Augmentation” to create pseudo-new data that the model must learn to account for.

Augmentation

Recall, we implemented the with_transforms method on the Dataloaders class in the Learner notebook.

tfms = [
    transforms.RandomCrop(28, padding=1),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.26], [0.35]),
]
tfmsc = transforms.Compose(tfms)

dls = fashion_mnist(512).with_transforms(
    {"image": batchify(tfmsc)}, lazy=True, splits=["train"]
)

xb, _ = dls.peek()
show_images(xb[:8, ...])

pixels = xb.view(-1)
pixels.mean(), pixels.std()

(tensor(0.0645), tensor(1.0079))

m_with_augmentation = ResNetWithGlobalPoolingInitialConv.kaiming()
train(m_with_augmentation, dls=dls, n_epochs=20)

MulticlassAccuracy	loss	epoch	train
0.804	0.762	0	train
0.845	0.568	0	eval
0.875	0.515	1	train
0.869	0.457	1	eval
0.885	0.407	2	train
0.881	0.369	2	eval
0.893	0.339	3	train
0.882	0.352	3	eval
0.900	0.298	4	train
0.856	0.388	4	eval
0.907	0.268	5	train
0.906	0.272	5	eval
0.913	0.246	6	train
0.868	0.380	6	eval
0.918	0.229	7	train
0.913	0.242	7	eval
0.923	0.215	8	train
0.891	0.306	8	eval
0.928	0.202	9	train
0.922	0.218	9	eval
0.933	0.187	10	train
0.925	0.215	10	eval
0.938	0.174	11	train
0.927	0.205	11	eval
0.943	0.160	12	train
0.927	0.206	12	eval
0.947	0.149	13	train
0.927	0.210	13	eval
0.951	0.139	14	train
0.931	0.199	14	eval
0.956	0.124	15	train
0.934	0.192	15	eval
0.962	0.110	16	train
0.940	0.178	16	eval
0.967	0.096	17	train
0.940	0.177	17	eval
0.971	0.086	18	train
0.943	0.177	18	eval
0.973	0.080	19	train
0.942	0.177	19	eval

Test Time Augmentation

Giving the model mulitple oppertunities to see the input can further improve the output.

xbf = torch.flip(xb, dims=(3,))
show_images([xb[0, ...], xbf[0, ...]])

def accuracy(model_predict_f, model, device=def_device):
    dls = fashion_mnist(512)
    n, n_correct = 0, 0
    for xb, yb in dls["test"]:
        xb = xb.to(device)
        yb = yb.to(device)
        yp = model_predict_f(xb, model)
        n += len(yb)
        n_correct += (yp == yb).float().sum().item()
    return n_correct / n


def pred_normal(xb, m):
    return m(xb).argmax(axis=1)

Let’s check the normal accuracy

accuracy(pred_normal, m_with_augmentation)

0.9415

Now, we can compare that to averaging the outputs when looking at both flips

def pred_with_test_time_augmentation(xb, m):
    yp = m(xb)
    xbf = torch.flip(xb, dims=(3,))
    ypf = m(xbf)
    return (yp + ypf).argmax(axis=1)


accuracy(pred_with_test_time_augmentation, m_with_augmentation)

0.9448

This is a slight improvement!

RandCopy

Another thing to try is creating new-ish images by cutting and pasting segments of the image onto different locations. A benefit to this approach is that the image should retain its pixel brightness distribution. (Compare to, for example, adding black will push the distribution downwards)

source

RandCopy

 RandCopy (pct=0.2, max_num=4)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

tfmsc2 = transforms.Compose([*tfms, RandCopy()])

dls2 = fashion_mnist(512).with_transforms(
    {"image": batchify(tfmsc2)},
    lazy=True,
    splits=["train"],
)

xb, _ = dls2.peek()
show_images(xb[:8, ...])

m_with_more_augmentation = ResNetWithGlobalPoolingInitialConv.kaiming()
train(m_with_more_augmentation, dls=dls2, n_epochs=20)

MulticlassAccuracy	loss	epoch	train
0.782	0.817	0	train
0.838	0.569	0	eval
0.850	0.576	1	train
0.858	0.459	1	eval
0.865	0.456	2	train
0.852	0.426	2	eval
0.873	0.395	3	train
0.877	0.349	3	eval
0.880	0.348	4	train
0.856	0.410	4	eval
0.889	0.317	5	train
0.903	0.273	5	eval
0.896	0.292	6	train
0.894	0.293	6	eval
0.902	0.273	7	train
0.888	0.297	7	eval
0.907	0.256	8	train
0.890	0.301	8	eval
0.913	0.242	9	train
0.921	0.228	9	eval
0.917	0.232	10	train
0.922	0.221	10	eval
0.920	0.220	11	train
0.924	0.208	11	eval
0.927	0.205	12	train
0.933	0.193	12	eval
0.931	0.192	13	train
0.925	0.210	13	eval
0.935	0.180	14	train
0.930	0.197	14	eval
0.940	0.169	15	train
0.937	0.182	15	eval
0.943	0.158	16	train
0.937	0.178	16	eval
0.947	0.147	17	train
0.937	0.173	17	eval
0.949	0.141	18	train
0.939	0.174	18	eval
0.951	0.137	19	train
0.938	0.174	19	eval

accuracy(pred_normal, m_with_more_augmentation)

0.9383

accuracy(pred_with_test_time_augmentation, m_with_more_augmentation)

0.9414

We’re so close to Jeremy’s 94.6% accuracy

Homework: Beat Jeremy

f = RandCopy()


def pred_with_test_time_augmentation_02(xb, m):
    ys = m(xb)
    ys += m(torch.flip(xb, dims=(3,)))
    for _ in range(6):
        ys += m(f(xb))
    return ys.argmax(axis=1)


accuracy(pred_with_test_time_augmentation_02, m_with_more_augmentation)

0.9402

Unfortunately, additional test time augmentation does not seem to improve the results.

Let’s try making it deeper.

mz = ResNetWithGlobalPoolingInitialConv.kaiming(nfs=[32, 64, 128, 256, 512, 512])
summarize(mz, [*mz.layers, mz.lin, mz.norm])
train(mz, dls=dls2, n_epochs=20)

Type	Input	Output	N. params	MFlops
Conv	(8, 1, 28, 28)	(8, 32, 28, 28)	864	0.6
ResidualConvBlock	(8, 32, 28, 28)	(8, 64, 14, 14)	57,664	11.2
ResidualConvBlock	(8, 64, 14, 14)	(8, 128, 7, 7)	230,016	11.2
ResidualConvBlock	(8, 128, 7, 7)	(8, 256, 4, 4)	918,784	14.7
ResidualConvBlock	(8, 256, 4, 4)	(8, 512, 2, 2)	3,672,576	14.7
ResidualConvBlock	(8, 512, 2, 2)	(8, 512, 1, 1)	4,983,296	5.0
Linear	(8, 512)	(8, 10)	5,130	0.0
BatchNorm1d	(8, 10)	(8, 10)	20	0.0
Total			9,868,350

MulticlassAccuracy	loss	epoch	train
0.801	0.767	0	train
0.836	0.539	0	eval
0.861	0.544	1	train
0.880	0.399	1	eval
0.871	0.441	2	train
0.867	0.400	2	eval
0.879	0.375	3	train
0.886	0.346	3	eval
0.889	0.330	4	train
0.847	0.399	4	eval
0.894	0.301	5	train
0.907	0.259	5	eval
0.901	0.277	6	train
0.900	0.276	6	eval
0.909	0.255	7	train
0.900	0.280	7	eval
0.913	0.241	8	train
0.920	0.240	8	eval
0.918	0.228	9	train
0.902	0.275	9	eval
0.922	0.217	10	train
0.920	0.226	10	eval
0.927	0.204	11	train
0.924	0.215	11	eval
0.932	0.190	12	train
0.931	0.191	12	eval
0.935	0.179	13	train
0.934	0.187	13	eval
0.941	0.164	14	train
0.932	0.195	14	eval
0.946	0.151	15	train
0.941	0.172	15	eval
0.950	0.138	16	train
0.944	0.163	16	eval
0.955	0.129	17	train
0.943	0.166	17	eval
0.957	0.121	18	train
0.943	0.163	18	eval
0.958	0.117	19	train
0.943	0.163	19	eval

accuracy(pred_normal, mz)

0.945

accuracy(pred_with_test_time_augmentation, mz)

0.947

Oh, that is just barely better than Jeremy.

I noticed a bug where the initialization is NOT incorporating the GenerualRelu leak parameter. Let’s see if that helps.

init_leaky_weights??

Signature: init_leaky_weights(module, leak=0.0)
Docstring: <no docstring>
Source:   
def init_leaky_weights(module, leak=0.0):
    if isinstance(module, (nn.Conv2d,)):
        init.kaiming_normal_(module.weight, a=leak)  # 👈 weirdly, called `a` here
File:      ~/Desktop/SlowAI/nbs/slowai/initializations.py
Type:      function

ResNetWithGlobalPoolingInitialConv().layers[0].act.a

0.1

Let’s fix that and see if we can improve the performance.

def init_leaky_weights_fixed(m):
    if isinstance(m, Conv):
        if m.act is None or not m.act.a:
            init.kaiming_normal_(m.weight)
        else:
            init.kaiming_normal_(m.weight, a=m.act.a)


class ResNetWithGlobalPoolingInitialConv2(ResNetWithGlobalPoolingInitialConv):
    @classmethod
    def kaiming(cls, *args, **kwargs):
        model = cls(*args, **kwargs)
        model.apply(init_leaky_weights_fixed)
        return model


mz2 = ResNetWithGlobalPoolingInitialConv2.kaiming(nfs=[32, 64, 128, 256, 512, 512])
train(mz2, dls=dls2, n_epochs=20)

MulticlassAccuracy	loss	epoch	train
0.795	0.783	0	train
0.859	0.528	0	eval
0.857	0.555	1	train
0.866	0.463	1	eval
0.867	0.450	2	train
0.850	0.445	2	eval
0.877	0.379	3	train
0.894	0.315	3	eval
0.885	0.334	4	train
0.895	0.298	4	eval
0.896	0.301	5	train
0.888	0.295	5	eval
0.902	0.278	6	train
0.901	0.273	6	eval
0.907	0.261	7	train
0.916	0.237	7	eval
0.913	0.243	8	train
0.919	0.227	8	eval
0.916	0.233	9	train
0.926	0.210	9	eval
0.921	0.218	10	train
0.925	0.206	10	eval
0.924	0.207	11	train
0.923	0.214	11	eval
0.929	0.197	12	train
0.927	0.198	12	eval
0.934	0.181	13	train
0.927	0.195	13	eval
0.939	0.169	14	train
0.936	0.183	14	eval
0.943	0.158	15	train
0.938	0.176	15	eval
0.948	0.145	16	train
0.943	0.164	16	eval
0.952	0.133	17	train
0.943	0.161	17	eval
0.955	0.125	18	train
0.944	0.160	18	eval
0.957	0.122	19	train
0.945	0.161	19	eval

accuracy(pred_normal, mz2)

0.9446

accuracy(pred_with_test_time_augmentation, mz2)

0.9473

Sadly, slightly worse for whatever reason.

Let’s try a Fixup Initialization

Fixup initialization

class FixupResBlock(nn.Module):
    def __init__(self, c_in, c_out, ks=3, stride=2):
        super(FixupResBlock, self).__init__()
        self.conv1 = nn.Conv2d(c_in, c_out, ks, 1, padding=ks // 2, bias=False)
        self.conv2 = nn.Conv2d(c_out, c_out, ks, stride, padding=ks // 2, bias=False)
        self.id_conv = nn.Conv2d(c_in, c_out, stride=1, kernel_size=1)
        self.scale = nn.Parameter(torch.ones(1))

    def forward(self, x_orig):
        x = self.conv1(x_orig)
        x = F.relu(x)
        x = self.conv2(x) * self.scale
        if self.conv2.stride == (2, 2):
            x_orig = F.avg_pool2d(x_orig, kernel_size=2, ceil_mode=True)
        x = F.relu(x + self.id_conv(x_orig))
        return x


class FixupResNet(nn.Module):
    def __init__(self, nfs, num_classes=10):
        super(FixupResNet, self).__init__()
        self.conv = nn.Conv2d(1, nfs[0], 5, stride=2, padding=2, bias=False)
        layers = []
        for c_in, c_out in zip(nfs, nfs[1:]):
            layers.append(FixupResBlock(c_in, c_out))
        self.layers = nn.Sequential(*layers)
        self.fc = nn.Linear(nfs[-1], num_classes)

    def forward(self, x):
        x = self.conv(x)
        x = self.layers(x)
        bs, c, h, w = range(4)
        x = x.mean((h, w))  # Global Average Pooling
        x = self.fc(x)
        return x

    @torch.no_grad()
    def init_weights(self):
        init.kaiming_normal_(self.conv.weight)
        n_layers = len(self.layers)
        for layer in self.layers:
            (c_out, c_in, ksa, ksb) = layer.conv1.weight.shape
            nn.init.normal_(
                layer.conv1.weight,
                mean=0,
                std=sqrt(2 / (c_out * ksa * ksb)) * n_layers ** (-0.5),
            )
            nn.init.constant_(layer.conv2.weight, 0)
        nn.init.constant_(self.fc.weight, 0)
        nn.init.constant_(self.fc.bias, 0)

    @classmethod
    def random(cls, *args, **kwargs):
        m = cls(*args, **kwargs)
        m.init_weights()
        return m

m = FixupResNet.random([8, 16, 32, 64, 128, 256, 512])
stats = StoreModuleStatsCB(m.layers)
train(m, extra_cbs=[stats])

MulticlassAccuracy	loss	epoch	train
0.333	1.663	0	train
0.643	0.853	0	eval
0.766	0.584	1	train
0.806	0.507	1	eval

stats.mean_std_plot()

Okay, fixup doesn’t look too promising.

On the forums, some things that were successful:

Dropout (and test time dropout augmentation)
Curriculum learning
Mish activation

This is how you would implement dropout

distributions.binomial.Binomial?

Init signature:
distributions.binomial.Binomial(
    total_count=1,
    probs=None,
    logits=None,
    validate_args=None,
)
Docstring:     
Creates a Binomial distribution parameterized by :attr:`total_count` and
either :attr:`probs` or :attr:`logits` (but not both). :attr:`total_count` must be
broadcastable with :attr:`probs`/:attr:`logits`.
Example::
    >>> # xdoctest: +IGNORE_WANT("non-deterinistic")
    >>> m = Binomial(100, torch.tensor([0 , .2, .8, 1]))
    >>> x = m.sample()
    tensor([   0.,   22.,   71.,  100.])
    >>> m = Binomial(torch.tensor([[5.], [10.]]), torch.tensor([0.5, 0.8]))
    >>> x = m.sample()
    tensor([[ 4.,  5.],
            [ 7.,  6.]])
Args:
    total_count (int or Tensor): number of Bernoulli trials
    probs (Tensor): Event probabilities
    logits (Tensor): Event log-odds
File:           ~/micromamba/envs/slowai/lib/python3.11/site-packages/torch/distributions/binomial.py
Type:           type
Subclasses:

class Dropout(nn.Module):
    def __init__(self, p=0.9):
        super().__init__()
        self.p = p

    def forward(self, x):
        if self.training:
            return x
        else:
            dist = distributions.binomial.Binomial(1, props=1 - self.p)
            return x * dist.sample(x.shape) / self.p

The difference between Dropout and Dropout2D is that Dropout2D only applies to the width and height dimensions.