42)
set_seed("ggplot") plt.style.use(
Augmentation
Adapted from:
- https://youtu.be/nlVOG2Nzc3k?si=8a4dKXqkibFS8aHh&t=5063
- https://www.youtube.com/watch?v=ItyO8s48zdc&t=18s
Let’s redefine the training loop for clarity.
train
train (model, lr=0.01, n_epochs=2, dls=<slowai.learner.DataLoaders object at 0x7f52478d40d0>, extra_cbs=())
Going wider
Can we get a better result by increasing the width of our network?
We didn’t spend much time designing the Residual CNN from the previous notebook. We simply replaced the Conv blocks with Residual Conv blocks, doubling the number of parameters.
In principle, ResNet’s are more stable than their CNN counterparts, so we should be able to make the network wider as well as deeper.
ResNet
ResNet (nfs:Sequence[int]=[16, 32, 64, 128, 256, 512], n_outputs=10)
Arbitrarily wide and deep residual neural network
= ResNet.kaiming()
m train(m)
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.831 | 0.474 | 0 | train |
0.870 | 0.378 | 0 | eval |
0.911 | 0.241 | 1 | train |
0.914 | 0.233 | 1 | eval |
Let’s create quick utility to view the shape of the model to check for areas of improvement
summarize
summarize (m, mods, dls=None, xb_=None)
hooks
hooks (mods, f)
flops
flops (x, w, h)
Estimate flops
"ResidualConvBlock") summarize(ResNet(),
Type | Input | Output | N. params | MFlops |
---|---|---|---|---|
ResidualConvBlock | (8, 1, 28, 28) | (8, 16, 28, 28) | 6,896 | 5.3 |
ResidualConvBlock | (8, 16, 28, 28) | (8, 32, 14, 14) | 14,496 | 2.8 |
ResidualConvBlock | (8, 32, 14, 14) | (8, 64, 7, 7) | 57,664 | 2.8 |
ResidualConvBlock | (8, 64, 7, 7) | (8, 128, 4, 4) | 230,016 | 3.7 |
ResidualConvBlock | (8, 128, 4, 4) | (8, 256, 2, 2) | 918,784 | 3.7 |
ResidualConvBlock | (8, 256, 2, 2) | (8, 512, 1, 1) | 3,672,576 | 3.7 |
ResidualConvBlock | (8, 512, 1, 1) | (8, 10, 1, 1) | 52,150 | 0.1 |
Total | 4,952,582 |
One of the important constaints of our model here is that that strides must be configured to downsample the image precisely to bs x c x 1 x 1
. We can make this more flexible by taking the final feature map (regardless of its height and width) and taking the average.
GlobalAveragePooling
GlobalAveragePooling (*args, **kwargs)
*Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
ResNetWithGlobalPooling
ResNetWithGlobalPooling (nfs:Sequence[int]=[16, 32, 64, 128, 256, 512], n_outputs=10)
Arbitrarily wide and deep residual neural network
= [
nfs 16,
32,
64,
128,
256, # 👈 notice that this leaves the feature map at 2x2...
]= ResNetWithGlobalPooling.kaiming(nfs)
m "ResidualConvBlock|Linear") summarize(m,
Type | Input | Output | N. params | MFlops |
---|---|---|---|---|
ResidualConvBlock | (8, 1, 28, 28) | (8, 16, 28, 28) | 6,896 | 5.3 |
ResidualConvBlock | (8, 16, 28, 28) | (8, 32, 14, 14) | 14,496 | 2.8 |
ResidualConvBlock | (8, 32, 14, 14) | (8, 64, 7, 7) | 57,664 | 2.8 |
ResidualConvBlock | (8, 64, 7, 7) | (8, 128, 4, 4) | 230,016 | 3.7 |
ResidualConvBlock | (8, 128, 4, 4) | (8, 256, 2, 2) | 918,784 | 3.7 |
Linear | (8, 256) | (8, 10) | 2,570 | 0.0 |
Total | 1,230,426 |
# ...but it still works!
train(m)
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.848 | 0.581 | 0 | train |
0.840 | 0.477 | 0 | eval |
0.913 | 0.287 | 1 | train |
0.916 | 0.271 | 1 | eval |
Can we reduce the number of parameters to save on memory? Indeed. One thing to focus on is the first ResidualConvBlock which has the most MegaFlops, because it applies the 16 kernels to each pixel. We can try replacing it with a Conv.
ResNetWithGlobalPoolingInitialConv
ResNetWithGlobalPoolingInitialConv (nfs:Sequence[int]=[16, 32, 64, 128, 256, 512], n_outputs=10)
Arbitrarily wide and deep residual neural network
= ResNetWithGlobalPoolingInitialConv.kaiming()
m *m.layers, m.lin, m.norm]) summarize(m, [
Type | Input | Output | N. params | MFlops |
---|---|---|---|---|
Conv | (8, 1, 28, 28) | (8, 16, 28, 28) | 432 | 0.3 |
ResidualConvBlock | (8, 16, 28, 28) | (8, 32, 14, 14) | 14,496 | 2.8 |
ResidualConvBlock | (8, 32, 14, 14) | (8, 64, 7, 7) | 57,664 | 2.8 |
ResidualConvBlock | (8, 64, 7, 7) | (8, 128, 4, 4) | 230,016 | 3.7 |
ResidualConvBlock | (8, 128, 4, 4) | (8, 256, 2, 2) | 918,784 | 3.7 |
ResidualConvBlock | (8, 256, 2, 2) | (8, 512, 1, 1) | 3,672,576 | 3.7 |
Linear | (8, 512) | (8, 10) | 5,130 | 0.0 |
BatchNorm1d | (8, 10) | (8, 10) | 20 | 0.0 |
Total | 4,899,118 |
train(m)
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.850 | 0.565 | 0 | train |
0.843 | 0.466 | 0 | eval |
0.913 | 0.281 | 1 | train |
0.916 | 0.265 | 1 | eval |
This approach yeilds a small, flexible and competitive model. What happens if we train for a while?
=20) train(ResNetWithGlobalPoolingInitialConv.kaiming(), n_epochs
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.846 | 0.662 | 0 | train |
0.876 | 0.527 | 0 | eval |
0.898 | 0.456 | 1 | train |
0.888 | 0.411 | 1 | eval |
0.907 | 0.353 | 2 | train |
0.888 | 0.367 | 2 | eval |
0.912 | 0.288 | 3 | train |
0.838 | 0.506 | 3 | eval |
0.918 | 0.250 | 4 | train |
0.856 | 0.418 | 4 | eval |
0.925 | 0.221 | 5 | train |
0.878 | 0.360 | 5 | eval |
0.935 | 0.192 | 6 | train |
0.901 | 0.287 | 6 | eval |
0.943 | 0.167 | 7 | train |
0.904 | 0.289 | 7 | eval |
0.950 | 0.148 | 8 | train |
0.902 | 0.301 | 8 | eval |
0.955 | 0.131 | 9 | train |
0.910 | 0.283 | 9 | eval |
0.959 | 0.115 | 10 | train |
0.910 | 0.296 | 10 | eval |
0.966 | 0.099 | 11 | train |
0.910 | 0.289 | 11 | eval |
0.974 | 0.077 | 12 | train |
0.913 | 0.317 | 12 | eval |
0.978 | 0.063 | 13 | train |
0.914 | 0.299 | 13 | eval |
0.984 | 0.048 | 14 | train |
0.919 | 0.289 | 14 | eval |
0.991 | 0.031 | 15 | train |
0.922 | 0.296 | 15 | eval |
0.996 | 0.017 | 16 | train |
0.927 | 0.293 | 16 | eval |
0.999 | 0.008 | 17 | train |
0.929 | 0.293 | 17 | eval |
1.000 | 0.006 | 18 | train |
0.928 | 0.293 | 18 | eval |
1.000 | 0.005 | 19 | train |
0.928 | 0.293 | 19 | eval |
The near perfect training accuracy indicates that the model is simply memorizing the dataset and failing to generalize.
We’ve discussed weight decay as a regularization technique. Could this help generalization?
We’ve posited that weight decay, as a regularization, prevents memorization. However, batch norm has a single set of coefficients that scale the layer output. Since weight decay also scales the layer weight, the model is able to “cheat.” Jeremy says to avoid weight decay and rely on a scheduler.
Instead, let’s try “Augmentation” to create pseudo-new data that the model must learn to account for.
Augmentation
Recall, we implemented the with_transforms
method on the Dataloaders
class in the Learner
notebook.
= [
tfms 28, padding=1),
transforms.RandomCrop(
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),0.26], [0.35]),
transforms.Normalize([
]= transforms.Compose(tfms)
tfmsc
= fashion_mnist(512).with_transforms(
dls "image": batchify(tfmsc)}, lazy=True, splits=["train"]
{
)
= dls.peek()
xb, _ 8, ...]) show_images(xb[:
= xb.view(-1)
pixels pixels.mean(), pixels.std()
(tensor(0.0645), tensor(1.0079))
= ResNetWithGlobalPoolingInitialConv.kaiming()
m_with_augmentation =dls, n_epochs=20) train(m_with_augmentation, dls
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.804 | 0.762 | 0 | train |
0.845 | 0.568 | 0 | eval |
0.875 | 0.515 | 1 | train |
0.869 | 0.457 | 1 | eval |
0.885 | 0.407 | 2 | train |
0.881 | 0.369 | 2 | eval |
0.893 | 0.339 | 3 | train |
0.882 | 0.352 | 3 | eval |
0.900 | 0.298 | 4 | train |
0.856 | 0.388 | 4 | eval |
0.907 | 0.268 | 5 | train |
0.906 | 0.272 | 5 | eval |
0.913 | 0.246 | 6 | train |
0.868 | 0.380 | 6 | eval |
0.918 | 0.229 | 7 | train |
0.913 | 0.242 | 7 | eval |
0.923 | 0.215 | 8 | train |
0.891 | 0.306 | 8 | eval |
0.928 | 0.202 | 9 | train |
0.922 | 0.218 | 9 | eval |
0.933 | 0.187 | 10 | train |
0.925 | 0.215 | 10 | eval |
0.938 | 0.174 | 11 | train |
0.927 | 0.205 | 11 | eval |
0.943 | 0.160 | 12 | train |
0.927 | 0.206 | 12 | eval |
0.947 | 0.149 | 13 | train |
0.927 | 0.210 | 13 | eval |
0.951 | 0.139 | 14 | train |
0.931 | 0.199 | 14 | eval |
0.956 | 0.124 | 15 | train |
0.934 | 0.192 | 15 | eval |
0.962 | 0.110 | 16 | train |
0.940 | 0.178 | 16 | eval |
0.967 | 0.096 | 17 | train |
0.940 | 0.177 | 17 | eval |
0.971 | 0.086 | 18 | train |
0.943 | 0.177 | 18 | eval |
0.973 | 0.080 | 19 | train |
0.942 | 0.177 | 19 | eval |
Test Time Augmentation
Giving the model mulitple oppertunities to see the input can further improve the output.
= torch.flip(xb, dims=(3,))
xbf 0, ...], xbf[0, ...]]) show_images([xb[
def accuracy(model_predict_f, model, device=def_device):
= fashion_mnist(512)
dls = 0, 0
n, n_correct for xb, yb in dls["test"]:
= xb.to(device)
xb = yb.to(device)
yb = model_predict_f(xb, model)
yp += len(yb)
n += (yp == yb).float().sum().item()
n_correct return n_correct / n
def pred_normal(xb, m):
return m(xb).argmax(axis=1)
Let’s check the normal accuracy
accuracy(pred_normal, m_with_augmentation)
0.9415
Now, we can compare that to averaging the outputs when looking at both flips
def pred_with_test_time_augmentation(xb, m):
= m(xb)
yp = torch.flip(xb, dims=(3,))
xbf = m(xbf)
ypf return (yp + ypf).argmax(axis=1)
accuracy(pred_with_test_time_augmentation, m_with_augmentation)
0.9448
This is a slight improvement!
RandCopy
Another thing to try is creating new-ish images by cutting and pasting segments of the image onto different locations. A benefit to this approach is that the image should retain its pixel brightness distribution. (Compare to, for example, adding black will push the distribution downwards)
RandCopy
RandCopy (pct=0.2, max_num=4)
*Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
= transforms.Compose([*tfms, RandCopy()])
tfmsc2
= fashion_mnist(512).with_transforms(
dls2 "image": batchify(tfmsc2)},
{=True,
lazy=["train"],
splits
)
= dls2.peek()
xb, _ 8, ...]) show_images(xb[:
= ResNetWithGlobalPoolingInitialConv.kaiming()
m_with_more_augmentation =dls2, n_epochs=20) train(m_with_more_augmentation, dls
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.782 | 0.817 | 0 | train |
0.838 | 0.569 | 0 | eval |
0.850 | 0.576 | 1 | train |
0.858 | 0.459 | 1 | eval |
0.865 | 0.456 | 2 | train |
0.852 | 0.426 | 2 | eval |
0.873 | 0.395 | 3 | train |
0.877 | 0.349 | 3 | eval |
0.880 | 0.348 | 4 | train |
0.856 | 0.410 | 4 | eval |
0.889 | 0.317 | 5 | train |
0.903 | 0.273 | 5 | eval |
0.896 | 0.292 | 6 | train |
0.894 | 0.293 | 6 | eval |
0.902 | 0.273 | 7 | train |
0.888 | 0.297 | 7 | eval |
0.907 | 0.256 | 8 | train |
0.890 | 0.301 | 8 | eval |
0.913 | 0.242 | 9 | train |
0.921 | 0.228 | 9 | eval |
0.917 | 0.232 | 10 | train |
0.922 | 0.221 | 10 | eval |
0.920 | 0.220 | 11 | train |
0.924 | 0.208 | 11 | eval |
0.927 | 0.205 | 12 | train |
0.933 | 0.193 | 12 | eval |
0.931 | 0.192 | 13 | train |
0.925 | 0.210 | 13 | eval |
0.935 | 0.180 | 14 | train |
0.930 | 0.197 | 14 | eval |
0.940 | 0.169 | 15 | train |
0.937 | 0.182 | 15 | eval |
0.943 | 0.158 | 16 | train |
0.937 | 0.178 | 16 | eval |
0.947 | 0.147 | 17 | train |
0.937 | 0.173 | 17 | eval |
0.949 | 0.141 | 18 | train |
0.939 | 0.174 | 18 | eval |
0.951 | 0.137 | 19 | train |
0.938 | 0.174 | 19 | eval |
accuracy(pred_normal, m_with_more_augmentation)
0.9383
accuracy(pred_with_test_time_augmentation, m_with_more_augmentation)
0.9414
We’re so close to Jeremy’s 94.6% accuracy
Homework: Beat Jeremy
= RandCopy()
f
def pred_with_test_time_augmentation_02(xb, m):
= m(xb)
ys += m(torch.flip(xb, dims=(3,)))
ys for _ in range(6):
+= m(f(xb))
ys return ys.argmax(axis=1)
accuracy(pred_with_test_time_augmentation_02, m_with_more_augmentation)
0.9402
Unfortunately, additional test time augmentation does not seem to improve the results.
Let’s try making it deeper.
= ResNetWithGlobalPoolingInitialConv.kaiming(nfs=[32, 64, 128, 256, 512, 512])
mz *mz.layers, mz.lin, mz.norm])
summarize(mz, [=dls2, n_epochs=20) train(mz, dls
Type | Input | Output | N. params | MFlops |
---|---|---|---|---|
Conv | (8, 1, 28, 28) | (8, 32, 28, 28) | 864 | 0.6 |
ResidualConvBlock | (8, 32, 28, 28) | (8, 64, 14, 14) | 57,664 | 11.2 |
ResidualConvBlock | (8, 64, 14, 14) | (8, 128, 7, 7) | 230,016 | 11.2 |
ResidualConvBlock | (8, 128, 7, 7) | (8, 256, 4, 4) | 918,784 | 14.7 |
ResidualConvBlock | (8, 256, 4, 4) | (8, 512, 2, 2) | 3,672,576 | 14.7 |
ResidualConvBlock | (8, 512, 2, 2) | (8, 512, 1, 1) | 4,983,296 | 5.0 |
Linear | (8, 512) | (8, 10) | 5,130 | 0.0 |
BatchNorm1d | (8, 10) | (8, 10) | 20 | 0.0 |
Total | 9,868,350 |
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.801 | 0.767 | 0 | train |
0.836 | 0.539 | 0 | eval |
0.861 | 0.544 | 1 | train |
0.880 | 0.399 | 1 | eval |
0.871 | 0.441 | 2 | train |
0.867 | 0.400 | 2 | eval |
0.879 | 0.375 | 3 | train |
0.886 | 0.346 | 3 | eval |
0.889 | 0.330 | 4 | train |
0.847 | 0.399 | 4 | eval |
0.894 | 0.301 | 5 | train |
0.907 | 0.259 | 5 | eval |
0.901 | 0.277 | 6 | train |
0.900 | 0.276 | 6 | eval |
0.909 | 0.255 | 7 | train |
0.900 | 0.280 | 7 | eval |
0.913 | 0.241 | 8 | train |
0.920 | 0.240 | 8 | eval |
0.918 | 0.228 | 9 | train |
0.902 | 0.275 | 9 | eval |
0.922 | 0.217 | 10 | train |
0.920 | 0.226 | 10 | eval |
0.927 | 0.204 | 11 | train |
0.924 | 0.215 | 11 | eval |
0.932 | 0.190 | 12 | train |
0.931 | 0.191 | 12 | eval |
0.935 | 0.179 | 13 | train |
0.934 | 0.187 | 13 | eval |
0.941 | 0.164 | 14 | train |
0.932 | 0.195 | 14 | eval |
0.946 | 0.151 | 15 | train |
0.941 | 0.172 | 15 | eval |
0.950 | 0.138 | 16 | train |
0.944 | 0.163 | 16 | eval |
0.955 | 0.129 | 17 | train |
0.943 | 0.166 | 17 | eval |
0.957 | 0.121 | 18 | train |
0.943 | 0.163 | 18 | eval |
0.958 | 0.117 | 19 | train |
0.943 | 0.163 | 19 | eval |
accuracy(pred_normal, mz)
0.945
accuracy(pred_with_test_time_augmentation, mz)
0.947
Oh, that is just barely better than Jeremy.
I noticed a bug where the initialization is NOT incorporating the GenerualRelu leak parameter. Let’s see if that helps.
init_leaky_weights??
Signature: init_leaky_weights(module, leak=0.0) Docstring: <no docstring> Source: def init_leaky_weights(module, leak=0.0): if isinstance(module, (nn.Conv2d,)): init.kaiming_normal_(module.weight, a=leak) # 👈 weirdly, called `a` here File: ~/Desktop/SlowAI/nbs/slowai/initializations.py Type: function
0].act.a ResNetWithGlobalPoolingInitialConv().layers[
0.1
Let’s fix that and see if we can improve the performance.
def init_leaky_weights_fixed(m):
if isinstance(m, Conv):
if m.act is None or not m.act.a:
init.kaiming_normal_(m.weight)else:
=m.act.a)
init.kaiming_normal_(m.weight, a
class ResNetWithGlobalPoolingInitialConv2(ResNetWithGlobalPoolingInitialConv):
@classmethod
def kaiming(cls, *args, **kwargs):
= cls(*args, **kwargs)
model apply(init_leaky_weights_fixed)
model.return model
= ResNetWithGlobalPoolingInitialConv2.kaiming(nfs=[32, 64, 128, 256, 512, 512])
mz2 =dls2, n_epochs=20) train(mz2, dls
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.795 | 0.783 | 0 | train |
0.859 | 0.528 | 0 | eval |
0.857 | 0.555 | 1 | train |
0.866 | 0.463 | 1 | eval |
0.867 | 0.450 | 2 | train |
0.850 | 0.445 | 2 | eval |
0.877 | 0.379 | 3 | train |
0.894 | 0.315 | 3 | eval |
0.885 | 0.334 | 4 | train |
0.895 | 0.298 | 4 | eval |
0.896 | 0.301 | 5 | train |
0.888 | 0.295 | 5 | eval |
0.902 | 0.278 | 6 | train |
0.901 | 0.273 | 6 | eval |
0.907 | 0.261 | 7 | train |
0.916 | 0.237 | 7 | eval |
0.913 | 0.243 | 8 | train |
0.919 | 0.227 | 8 | eval |
0.916 | 0.233 | 9 | train |
0.926 | 0.210 | 9 | eval |
0.921 | 0.218 | 10 | train |
0.925 | 0.206 | 10 | eval |
0.924 | 0.207 | 11 | train |
0.923 | 0.214 | 11 | eval |
0.929 | 0.197 | 12 | train |
0.927 | 0.198 | 12 | eval |
0.934 | 0.181 | 13 | train |
0.927 | 0.195 | 13 | eval |
0.939 | 0.169 | 14 | train |
0.936 | 0.183 | 14 | eval |
0.943 | 0.158 | 15 | train |
0.938 | 0.176 | 15 | eval |
0.948 | 0.145 | 16 | train |
0.943 | 0.164 | 16 | eval |
0.952 | 0.133 | 17 | train |
0.943 | 0.161 | 17 | eval |
0.955 | 0.125 | 18 | train |
0.944 | 0.160 | 18 | eval |
0.957 | 0.122 | 19 | train |
0.945 | 0.161 | 19 | eval |
accuracy(pred_normal, mz2)
0.9446
accuracy(pred_with_test_time_augmentation, mz2)
0.9473
Sadly, slightly worse for whatever reason.
Let’s try a Fixup Initialization
Fixup initialization
class FixupResBlock(nn.Module):
def __init__(self, c_in, c_out, ks=3, stride=2):
super(FixupResBlock, self).__init__()
self.conv1 = nn.Conv2d(c_in, c_out, ks, 1, padding=ks // 2, bias=False)
self.conv2 = nn.Conv2d(c_out, c_out, ks, stride, padding=ks // 2, bias=False)
self.id_conv = nn.Conv2d(c_in, c_out, stride=1, kernel_size=1)
self.scale = nn.Parameter(torch.ones(1))
def forward(self, x_orig):
= self.conv1(x_orig)
x = F.relu(x)
x = self.conv2(x) * self.scale
x if self.conv2.stride == (2, 2):
= F.avg_pool2d(x_orig, kernel_size=2, ceil_mode=True)
x_orig = F.relu(x + self.id_conv(x_orig))
x return x
class FixupResNet(nn.Module):
def __init__(self, nfs, num_classes=10):
super(FixupResNet, self).__init__()
self.conv = nn.Conv2d(1, nfs[0], 5, stride=2, padding=2, bias=False)
= []
layers for c_in, c_out in zip(nfs, nfs[1:]):
layers.append(FixupResBlock(c_in, c_out))self.layers = nn.Sequential(*layers)
self.fc = nn.Linear(nfs[-1], num_classes)
def forward(self, x):
= self.conv(x)
x = self.layers(x)
x = range(4)
bs, c, h, w = x.mean((h, w)) # Global Average Pooling
x = self.fc(x)
x return x
@torch.no_grad()
def init_weights(self):
self.conv.weight)
init.kaiming_normal_(= len(self.layers)
n_layers for layer in self.layers:
= layer.conv1.weight.shape
(c_out, c_in, ksa, ksb)
nn.init.normal_(
layer.conv1.weight,=0,
mean=sqrt(2 / (c_out * ksa * ksb)) * n_layers ** (-0.5),
std
)0)
nn.init.constant_(layer.conv2.weight, self.fc.weight, 0)
nn.init.constant_(self.fc.bias, 0)
nn.init.constant_(
@classmethod
def random(cls, *args, **kwargs):
= cls(*args, **kwargs)
m
m.init_weights()return m
= FixupResNet.random([8, 16, 32, 64, 128, 256, 512])
m = StoreModuleStatsCB(m.layers)
stats =[stats]) train(m, extra_cbs
MulticlassAccuracy | loss | epoch | train |
---|---|---|---|
0.333 | 1.663 | 0 | train |
0.643 | 0.853 | 0 | eval |
0.766 | 0.584 | 1 | train |
0.806 | 0.507 | 1 | eval |
stats.mean_std_plot()
Okay, fixup doesn’t look too promising.
On the forums, some things that were successful:
- Dropout (and test time dropout augmentation)
- Curriculum learning
- Mish activation
This is how you would implement dropout
distributions.binomial.Binomial?
Init signature: distributions.binomial.Binomial( total_count=1, probs=None, logits=None, validate_args=None, ) Docstring: Creates a Binomial distribution parameterized by :attr:`total_count` and either :attr:`probs` or :attr:`logits` (but not both). :attr:`total_count` must be broadcastable with :attr:`probs`/:attr:`logits`. Example:: >>> # xdoctest: +IGNORE_WANT("non-deterinistic") >>> m = Binomial(100, torch.tensor([0 , .2, .8, 1])) >>> x = m.sample() tensor([ 0., 22., 71., 100.]) >>> m = Binomial(torch.tensor([[5.], [10.]]), torch.tensor([0.5, 0.8])) >>> x = m.sample() tensor([[ 4., 5.], [ 7., 6.]]) Args: total_count (int or Tensor): number of Bernoulli trials probs (Tensor): Event probabilities logits (Tensor): Event log-odds File: ~/micromamba/envs/slowai/lib/python3.11/site-packages/torch/distributions/binomial.py Type: type Subclasses:
class Dropout(nn.Module):
def __init__(self, p=0.9):
super().__init__()
self.p = p
def forward(self, x):
if self.training:
return x
else:
= distributions.binomial.Binomial(1, props=1 - self.p)
dist return x * dist.sample(x.shape) / self.p
The difference between Dropout
and Dropout2D
is that Dropout2D
only applies to the width and height dimensions.