# How I am learning AI
I am currently teaching myself AI and here's a little sneak peek into how I learn. The intent of the post is to expose my approach to learning any new topic. The tools and methods differ based on what I want to learn, but the structure remains the same.
*The rest of the post is written in a narrative structure, because this is exactly how my mind works when I'm learning something.*
**I have come across the terms "sparse architecture" / "sparse model" a few times in reference to AI models. What does that mean?**
> Can I figure out anything from the words themselves?
> - In AI, "architecture" refers to the structural design of neural networks. It refers to the structure of models that define how they process and learn from data. Eg: RNNs or Transformers
> - In English, I believe "sparse" means "not too many" eg: "The details are sparse about ...".
Okay let's do some research. I look up what sparse architecture means and come across this:
> "A sparse architecture is a neural network where many of the connections between layers are inactive (set to zero)."
**What does setting something to zero have to do with the word sparse?**
> In mathematics and computing, a **sparse matrix** is one where most elements are zero, while a **dense matrix** has mostly non-zero values.
Ah. Cool, now I understand where the word comes from. **What does it mean to set a connection to zero?**
> All neural networks are a series of matrices. Each weight encodes the learned patterns of the data it is trained on. To set a connection to zero means to set some of the elements of the matrix to zero.
**If we set a bunch of elements to zero, won't we just lose information encoded in the weights of the matrix?**
> Not necessarily. There is always some noise and redundancy, which can be pruned. Pruning often works more like compression than simple information deletion. The network can learn to encode similar information through fewer parameters, redistributing important features across the remaining weights.
However, if pruning is too aggressive or done without proper methodology, you can indeed lose important information and capabilities. The key is finding the right balance and using techniques that can identify truly dispensable parameters while preserving the model's essential functional structure.
Nice, okay. That makes sense. **But how does one decide which elements to actually set to zero in the matrix?**
> There are a few approaches to do this.
> - Magnitude based: The idea is that the connections with the smallest weights (i.e., those with little influence on the network's output) can be set to zero.
> - Gradient based: By looking at the gradients, we can evaluate how much the removal of a connection will affect the loss function. And weights with a gradient close to zero can be set to zero without impacting the model performance too much.
> - Activation based: This method considers the neuron activations during training. If certain neurons rarely activate or contribute minimally to the output, their corresponding weights can be pruned.
> - Learning pruning based on regularisation: Regularization techniques modify the loss function to include an extra term that penalizes certain weight configurations. The idea is to encourage sparsity so that unnecessary weights shrink to zero and can be removed without significantly affecting performance.
Cool! I understand that. This is very similar to lossy media compression. You try to ensure that the most important information is still present and eliminate unnecessary detail. Eg: in audio, things that will not be audible to the most human ears will be removed. This makes the file sizes a lot smaller to store in memory. **Is that also the true for models? Is that why we do it? Or is there another reason?**
> - Reduced Complexity: By reducing the number of active connections, the model has fewer parameters to compute, making the training and inference processes faster and less resource-intensive. This can be crucial in real-time or resource-constrained environments.
> - Memory Efficiency: Sparse models require less memory to store and process, making them scalable for deployment in devices with limited storage or processing capabilities.
Hmm. True. The real cost driver for models is compute, not memory. And if some of the elements of the weight matrix is set to zero, then this means fewer matrix multiplications to perform. That should mean that, greater the sparsity of the model cheaper it will be to compute an inference.
So, summarising it all together:
- Sparse architecture in AI refers to neural networks where many connections between layers are inactive (set to zero).
- Pruning techniques (magnitude-based, gradient-based, activation-based, and regularization) help identify and remove less important connections while preserving model performance.
- Benefits of sparsity include faster training/inference, lower computational costs, and improved memory efficiency, making models more scalable for real-world applications.
------------
This method of learning is neither novel nor original, but just how I have learnt to learn over years. Learning anything is just doing this over and over again for a long time. I have pages and pages of notes in Obsidian that look basically like this on all sorts of subjects.
*Published: 17/02/2025*