Hyper Parameters

Understanding Hyper Parameters in AI and Machine Learning

This guide covers the key hyperparameters that influence the performance of AI models, including context window size and embedding size.


Context Windows Size

Context window size is the maximum number of tokens the model can process in a single input. It determines the model’s ability to understand and generate text based on the context provided. If you increase the context window size, the model can consider more information when generating responses, but it may require more memory and processing power. It happens when it has to remember what was asked earlier. In other words, it’s how much of the conversation or input history the model considers when making its predictions. For example, if you’re having a conversation with the AI, the context window determines how many of the previous messages the model can “remember” and use to generate a coherent response. A larger context window means the model can take into account more of the previous conversation, leading to more contextually aware responses, but it can also require more computational resources, which can slow down performance. The context window size keep increasing as the conversation goes on.

Analogy with Humans

In humans, the context window size is determined by the amount of information that can be stored in short-term memory. When we have a conversation with someone, we store the conversation history in our short-term memory, which allows us to recall previous conversations and use that information to generate a coherent response. The context window size is the maximum amount of information that can be stored in short-term memory and used to generate responses. With small Context Window, the AI model can only remember a few words or sentences from the conversation history, which can limit its ability to generate coherent responses. With large Context Window, the AI model can remember a lot of words or sentences from the conversation history, which can help it generate more coherent responses. However, it may require more computational resources, which can slow down performance.

Embedding size

The embedding size is the size of the vector representation for each token in the input text. It determines the model’s ability to understand and generate text based on the input. If you increase the embedding size, the model can consider more information when generating responses, but it may require more memory and processing power. The embedding size is a hyperparameter that affects the model’s ability to learn and represent the relationships between words in the input text. A larger embedding size allows the model to capture more nuanced relationships between words, leading to more accurate and contextually relevant responses. However, increasing the embedding size can also increase the computational cost of training and running the model. The embedding size is an essential parameter to consider when fine-tuning a language model for specific tasks or applications.

To understand it better we can take an example of a word embedding. A word embedding is a vector representation of a word in a high-dimensional space. Each word in the vocabulary is represented by a unique vector, and the distance between vectors reflects the semantic similarity between words. For example, words with similar meanings will have similar vector representations, while words with different meanings will have dissimilar vector representations. The embedding size determines the dimensionality of the word vectors, with larger embedding sizes capturing more complex relationships between words. By adjusting the embedding size, you can control the model’s ability to understand and generate text, balancing between accuracy and computational efficiency.

Quantization

Quantization is a technique used to reduce the size of a neural network model by reducing the precision of its weights. This can result in faster processing and lower memory usage. It is used in various machine learning models, including large language models like Ollama. Quantization can be done using various techniques, such as quantization aware training (QAT) and quantization aware fine-tuning (QAFT). By reducing the precision of the weights, quantization can make the model more efficient and faster to run, especially on devices with limited computational resources. However, quantization can also affect the model’s accuracy, so it’s essential to balance the trade-offs between speed and accuracy when using this technique.

Temperature

Temperature is a hyperparameter used in language models to control the randomness of the generated text. A higher temperature value results in more diverse and creative responses, while a lower temperature value produces more conservative and predictable responses. By adjusting the temperature, you can control the balance between creativity and coherence in the model’s output. For example, a high temperature value can lead to more imaginative and unexpected responses, while a low temperature value can produce more accurate and contextually relevant answers. Temperature is a crucial parameter to consider when fine-tuning a language model for specific tasks or applications.


Comprehension Works in AI Models

When a prompt—such as a question or a statement—is given to an AI model, the model processes it by analyzing the structure and meaning of the text. These models do not “understand” language in the human sense but learn patterns based on statistical relationships between words and phrases in vast datasets. Learning patterns is also in its own way based upon training based upon rules used in patterns recognition. For example a model does understand the difference between different tenses.

Modern AI models, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT-3 (Generative Pre-trained Transformer 3), rely on deep learning techniques to predict the most likely response based on context. These models use embeddings—numerical representations of words—allowing them to capture semantic relationships. For example, if the embedding size is 100, each word is represented as a 100-dimensional vector.

Handling ambiguity or abstract concepts is one of the key challenges in AI. Models like BERT consider bidirectional context, meaning they analyze words in relation to the words before and after them. Transformer-based models process entire sentences simultaneously, making them more effective at grasping meaning compared to earlier AI models that processed text sequentially.

The choice of embedding size and model complexity affects performance. A larger embedding size can capture more linguistic nuances but requires more computational power and memory. Striking the right balance is crucial for building efficient AI models that generate accurate and meaningful responses.


Vector Example

Suppose we have two words, king and queen. If the embedding size is small, say 2 dimensions, their vectors might look like this:

king: [0.5, 0.3] queen: [0.4, 0.2]

But with a larger embedding size, like 5 dimensions, they might be:

king: [0.7, 0.2, 0.1, 0.8, 0.4] queen: [0.6, 0.3, 0.2, 0.7, 0.5]

In the larger space, the vectors can capture more nuanced relationships. Maybe king and queen are more similar in the larger space because their vectors are closer in more dimensions. But how is this similarity measured? It is done using cosine similarity , which measures the angle between vectors. So in higher dimensions, the model can better capture the subtle differences and similarities between words

Calculating Cosine Similarity

Cosine similarity measures the angle between two vectors, indicating their similarity. For vectors A and B:

$$[ \text{Cosine Similarity} = \frac{A \cdot B}{|A| |B|} ]$$

where:

  • A · B = Dot product of vectors A and B

  • ||A|| = Magnitude (length) of vector A

  • ||B|| = Magnitude (length) of vector B

  • For “king” and “queen” with 5-dimensional vectors:

$$[ A \cdot B = (0.7 \times 0.6) + (0.2 \times 0.3) + (0.1 \times 0.2) + (0.8 \times 0.7) + (0.4 \times 0.5) = 0.42 + 0.06 + 0.02 + 0.56 + 0.2 = 1.26 ] \\ [ |A| = \sqrt{0.7^2 + 0.2^2 + 0.1^2 + 0.8^2 + 0.4^2} = \sqrt{0.49 + 0.04 + 0.01 + 0.64 + 0.16} = \sqrt{1.34} \approx 1.16 ] \\ [ |B| = \sqrt{0.6^2 + 0.3^2 + 0.2^2 + 0.7^2 + 0.5^2} = \sqrt{0.36 + 0.09 + 0.04 + 0.49 + 0.25} = \sqrt{1.23} \approx 1.11 ] \\ [ \text{Cosine Similarity} = \frac{1.26}{1.16 \times 1.11} \approx \frac{1.26}{1.29} \approx 0.976 ] \\$$

This high similarity score indicates that “king” and “queen” are closely related in this larger embedding space.


Conclusion

Imagine each word as a point in a multi-dimensional space. The more dimensions (larger embedding size), the more precisely each point can be located. This precision helps the model understand context better. For example, in a sentence like “The bank can’t handle the crisis,” the word “bank” could refer to a financial institution or the side of a river. A larger embedding size might help the model distinguish between these meanings based on surrounding words.

But if the embedding size is too large, the model might overfit to the training data, meaning it performs well on seen data but poorly on new data. So finding the right balance is key. It’s like choosing the right resolution for an image—too high, and it’s unnecessarily detailed and bulky; too low, and it loses important features.

Embedding size relates to other hyperparameters like context window size and quantization. Quantization reduces the precision of these vectors, which can make the model faster but less accurate. So if you have a large embedding size, quantization might help manage the computational load without completely sacrificing performance.

In summary, embedding size is crucial because it determines how detailed and accurate the model’s understanding of language can be. However, it’s all about finding the right size that balances accuracy with computational efficiency. Too small, and the model can’t capture nuances; too large, and it becomes slow and resource-heavy.