Network Pruning: Pruning involves removing less important weights or neurons from the network, resulting in a smaller network size with similar performance.
Quantization: Converting the network's parameters and activations from floating-point numbers to lower-precision fixed-point numbers.
Factorization: Decomposing the weight tensors in the network into smaller tensors and reducing the number of parameters.
Knowledge Distillation: Transferring knowledge from a large pre-trained model to a smaller model, where the large model acts as a teacher and the smaller model as a student.
Model Architecture: Choosing a smaller network architecture with fewer parameters, such as MobileNet, ShuffleNet, etc.
Downsizing a pre-trained CNN model to a smaller size without losing performance can be achieved by several techniques, including:
Pruning: This involves removing neurons and connections in the model that have the least impact on performance. This can reduce the size of the model without sacrificing accuracy.
Quantization: This involves converting the weights and activations in the model from floating-point to integer representations. This reduces the size of the model and can also improve the performance of the model on hardware with limited memory.
Low-rank approximation: This involves approximating the weight matrices in the model with lower-dimensional matrices. This reduces the size of the model without sacrificing accuracy.
Architecture search: This involves using algorithms to find the optimal architecture for a given dataset and computational budget. This can result in a smaller model with improved accuracy.
Transfer learning: This involves using a pre-trained model as a starting point and fine-tuning it on a smaller dataset. This can result in a smaller model with improved accuracy that is specifically tailored to the new dataset.
It's important to note that downsizing a pre-trained model may come with trade-offs in terms of performance, and the best approach will depend on the specific requirements of the application. It may also require multiple iterations of fine-tuning and experimentation to achieve the desired balance between size and accuracy.
This question can be answered in two parts. Firstly, (A) DOWNSIZING PRETRAINED MODEL is different compared to (B) CREATE SMALLER CNN THAT MIMIC LARGER ONE. (A) can be done by pruning or quantization, while (B) can be done by knowledge distillation.
For (A):
Pruning: This removes less important weights or neurons from the CNN (or any types of AI model), to reduce the size of the model without sacrificing accuracy. However, pruning does not translate to higher inference speed.
Quantization: This involves converting the weights in the model from floating-point FP32 to FP16 or integer (INT8) representations. This can reduce the size of the model, and could speed up the model. For instance, FP16 model (may be) faster than FP16 in GPU, while INT8 model is faster than FP32 and FP16 in CPU.
For (B):
Knowledge Distillation: This is a process to transfer knowledge from a large pretrained model to a smaller model. The larger pretrained model acts as a teacher, while the smaller model acts as the student.
In practice, method (A) is easier to implement. There are various tools like Intel's OpenVINO and Google's TensorFlow Lite that can perform pruning and quantization automatically with little to no effort. You only need to pass the pretrained model to any of the toolkits, to downsize the model. On the other hand, method (B) requires more efforts. You can search the limitation of knowledge distillation in the internet.