(Paper Reading) Efficient Deep Learning Survey A Survey on Making Deep Learning Models Smaller, Faster, and Better
Published by Google Research.
1. Main challenges in deep learning areas
- Sustainable Server-Side Scaling. deploying and letting inference run for over a long period of time could still turn out to be expensive in terms of consumption of server-side RAM, CPU, etc.
- Enabling On-Device Deployment. Certain deep learning applications need to run real time on IoT and smart devices (where the model inference happens directly on the device), for a multitude of reasons (privacy, connectivity, responsiveness).
- Privacy & Data Sensitivity. Being able to use as little data as possible for training is critical when the user-data might be sensitive.
- New Applications. Certain new applications offer new constraints (around model quality
or footprint) that existing off-the-shelf models might not be able to address. - Explosion of Models. While a singular model might work well, training and/or deploying multiple models on the same infrastructure (colocation) for different applications might end up exhausting the available resources. (Multi-task learning?)
Specifically, the common theme around the above challenges is efficiency, which can be categorized into two aspects:
- Inference Efficiency
- Training Efficiency
Our goal is to train and deploy pareto-optimal models with respect to model quality and its footprint.
2. A mental model
- Compression Techniques. Pruning, quantization etc.
- Learning Techniques. Distillation etc.
- Automation. HPO, NAS etc.
- Efficient Architectures. Some fundamental blocks, such as conv, attention.
- Infrastructure. TF, Pytorch etc.
3. Landscape of Efficient Deep Learning
3.1 Compression Techniques
In some cases if the model is over-parameterized, these techniques can improve model generalization.
- Pruning, prefer structured pruning
- Quantization,
- Weight quantization, we can map the minimum weight value $(𝑥_{𝑚𝑖𝑛})$ in that matrix to 0, and the maximum value (𝑥𝑚𝑎𝑥 ) to $2^{b} − 1$ (where 𝑏 is the number of bits of precision, and 𝑏 < 32). XNOR-Net, Binarized Neural Networks and others use 𝑏 = 1, and thus have weight matrices which just have two possible values 0 or 1, and the quantization function there is simply the $\sign(𝑥)$ function (assuming the weights are symmetrically distributed around 0). Binary quantization still need support from the underlying hardware.
- Activation quantization, This means all intermediate layer inputs and outputs are also in fixed-point, and there is no need to dequantize the weight matrices since they can be used directly along with the inputs.
- Quantization-aware training (QAT), post-training quantization leads to quality loss inference. Fake quantization, the training happens in floating-point but the forward-pass simulates the quantization behavior during inference.
QAT is good, but tools like TF Lite have made it easy to rely on post-training quantization. For performance reasons, it is best to consider the common operations that follow a typical layer such as Batch-Norm, Activation, etc. and ‘fold’ them in the quantization operations.
- Other Compression Techniques. There are other compression techniques like Low-Rank
Matrix Factorization, K-Means Clustering, Weight-Sharing etc. which are also actively being used for model compression and might be suitable for further compressing hotspots in a model.
3.2 Learning Techniques
- Distillation, transfer ensembled models in weakly supervised learning to a smaller model (2006). Knowledge distillation, in my opinion, the large model provides informative relations among classes. Strategies for intermediate-layer distillation have also shown to be effective in the case of complex networks.
Data augmentation (training efficiency), various transformations
- label-invariant transformations, e.g., flipping, cropping, rotations.
- Label-Mixing transformations, Mixup, The intuition is that the model should be
able to extract out features that are relevant for both the classes. - Data-Dependent transformations, In this case, transformations are chosen such that they maximize the loss for that example [56], or are adversarially chosen so as to fool the classifier.
- Synthetic sampling, SMOTE, GAN
Composition of transformations, combing above methods
Auto-Augment sounds practical…
- Self-Supervised Learning, fine-tuning models pre-trained with Self-Supervised learning
are data-efficient (they converge faster, attain better quality for the same amount of labeled data when compared to training from scratch, etc.) Contrastive learning is effective. SSL provides a good pre-trained model for data-limited scenarios.
3.3 Automation
The trade-off is that these methods might require large computational resources, and hence have to be carefully applied.
- Hyper-Parameter Optimization, Grid Search, Random Search (May not effective for high-dimensional space), Bayesian Optimization (it also makes the search sequential, though it is possible to run multiple trials in parallel, overall it will lead to some wasted trials.), Population based training (similar to evolutionary approaches), MAB algorithms
- NAS, search space, search algorithm & state, evaluation strategy. From single target accuracy to multi-goals, such as latency.
3.4 Efficient Architectures
- Depth-Separable Convolution, classical
- Attention Mechanism & Transformer Family, very hot!
- Random Projection Layers & Models, The core-benefit of the projection operation when compared to embedding tables is 𝑂(𝑇 ) space required instead of 𝑂(𝑉 .𝑑) (𝑇 seeds required for 𝑇 hash functions). On the other hand, random-projection computation is 𝑂(𝑇 ) too v/s 𝑂(1) for embedding table lookup. Hence, the projection layer is clearly useful when model size is the primary focus of optimization. Reduce the memory, while increasing latency.
3.5 Infrastructure
- TF Lite
- PyTorch Mobile
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!