Navigating the AI Jungle - A Beginner's Guide to AI Models and Terminology

Posted on July 23, 2024

10 minutes • 1984 words • Other languages: Deutsch

(This text was translated by ChatGPT from the german version)

Nowadays, everyone is talking about artificial intelligence. However, most of the time, they are only referring to ChatGPT or corresponding wrappers around its API. When you first start dealing with the many different models, you are likely to feel as overwhelmed as I did at the beginning. For this reason, I want to explain the most important terms in this article, so you can navigate the AI jungle.

What exactly are the traits of artificial intelligence?

Even though you generally only hear “artificial intelligence”:

AI is not just AI. And a computer algorithm that only uses if-else statements is not “Artificial Intelligence” (AI). AI is always characterized by making human-like decisions, such as

recognizing shapes or images
learning from experience
making decisions even when the situation is unknown
“creative” generation of content (texts, images, …)

…and so on. To make this possible, neural networks (more or less mathematical models that can assume states similar to human neurons) are usually fed with a lot of data. Each neuron can take on a certain value. Initially, this value can always be the same, or even chosen randomly. Then, they are fed with data, results are generated through mathematical activation functions (e.g., Sigmoid, ReLU, Tanh) and checked if they largely match the expected result. If so, the value remains as it was; otherwise, it is changed in small steps towards the target value. In the end, these neurons have “learned” certain things and you get a model.

Types of AI Models

Since there are different applications of artificial intelligence, there are, of course, different models. Here is an incomplete list to get an impression of the versatility:

Large Language Models (LLMs)

Large Language Models (LLMs) like GPT, Llama, Claude, MistralAI, WizardLM,… specialize in performing natural language tasks. They are trained on huge, normalized text corpora (e.g., websites, short stories, …) in various formats (assistant, instruction, chat, …). They can solve tasks such as:

Text Generation: Creating human-like text.
Question Answering: Answering questions based on their knowledge.
Translation: Translating texts between different languages.

LLMs are versatile and are often used in chatbots, text analysis tools, and translation services.

Image Generation

Image generation models, like Stable Diffusion, are trained to produce high-quality images from noise. Their applications include:

Art and Design: Creating digital artworks and design patterns.
Image Editing: Generating realistic images based on specific inputs.
Content Creation: Creating visual content for various media.

They work by using large amounts of tagged images (e.g., cat, dog, pretty, ugly, deformed, …) for training a model. Once you have a model, you can specify what you want during image generation (A cat in front of the Empire State Building at night) and what you don’t want (deformed, too_many_limbs, …).

Computer Vision Models

Computer Vision Models specialize in analyzing and interpreting visual data. Examples include:

Convolutional Neural Networks (CNNs): Widely used for tasks like image recognition and object classification.
Generative Adversarial Networks (GANs): Used to generate or manipulate realistic images.
YOLO (You Only Look Once): A real-time object detection algorithm used in applications like surveillance and autonomous driving.

CNNs are also used in modern OCR solutions to convert scanned images into text or recognize handwriting.

Other Types of Models

In addition to the ones mentioned, there are many other models. Examples include:

Speech Recognition Models: Recognizing speech
Speech Synthesis Models: Generating speech based on voice samples
Reinforcement Learning Models: Teaching skills through trial and error, such as training robots or playing Go (a complex board game)

It’s also exciting that multi-purpose models can be trained to perform multiple tasks (e.g., generating text and interpreting Excel files).

Where Can I Get Such Models?

Huggingface has established itself as the de facto standard for AI models. The platform offers an extensive library of pre-trained models that are easily accessible and can be used in a variety of applications. Additionally, there is a convenient CLI for downloading such models and a VRAM calculator to see if a particular model fits into the graphics card memory (VRAM).

So Many Terms, Help!

You just want to try a model, but which one should you choose? This is probably the most difficult question in the whole topic. New models are emerging hourly, and if you’re not interested in mathematical benchmarks, often the only way to decide is to try them out. To make a first selection, however, it is useful to learn a bit of AI terminology:

7b, 12, 32b,…

Numbers like 7b or 1.5T or similar indicate how many parameters a model uses. A parameter contains neurons, weights, and biases, but this would go too far here. As a rule of thumb: The more parameters a model has, the more complex tasks it can solve, and the more VRAM it needs.

Context Sizes

Most models indicate the context sizes they were trained with (e.g., 2048, 4096). You need to know that AI does not work directly with words, pixels, or similar, but with so-called tokens. A token could be, for example, a syllable of a word. Or even just a letter. Or an 8-pixel group in an image. Everything an AI “knows” to fulfill its task must fit into this context. This includes, for example, a chat history, additional information about the user, or similar. You can think of it as short-term memory. What doesn’t fit into the context, the AI doesn’t know. Larger is better, but it uses more VRAM and increases computation time.

Quantization: Q3_K_S vs Q4_K_S

If a model is quantized, the number of bits used to represent a parameter has been reduced. This can significantly improve the efficiency and speed of a model, but the accuracy of the model also decreases.

The Q… something indications are therefore indications of what quantization has been applied to the model. For example:

Q3_K_S: This refers to quantization where parameters are reduced to a lower bit depth, in this case, 3 bits. The “_K_S” stands for a specific method or scheme of quantization.
Q4_K_S: Similar to Q3_K_S, but here the parameters are reduced to 4 bits. This offers higher precision than Q3_K_S but requires a bit more storage space and computing power. Q4_K_M: An alternative method of quantization, also using 3 bits per parameter, but possibly applying a different technique or algorithm for quantization.

Usually, the model description specifies which quantization has what effect. In many cases, it also describes how the quantized model compares to the original model. If you’re unsure where to start, I would recommend a Q4_K_M model and then work your way up or down depending on what fits into the VRAM or produces good enough results for your application.

GGUF (Generalized, Graph-based Universal Format)

GGUF is a format for representing models and their structures. It has become the de facto standard for using many types of models. It supports a wide range of model structures and data formats, making it significantly easier to try out a model without having to convert or quantize it yourself. If possible, I would always prefer a GGUF format for models for simplicity.

LoRa (Low-Rank Adaptation)

LoRa stands for Low-Rank Adaptation and is a technique for adapting pre-trained models. LoRa are NOT standalone models. Instead, they extend existing general-purpose models for a specific purpose. This works by adding additional training data to the pre-trained general model.

You can think of it a bit like having a database table with users and an age. If you now also want age groups (e.g., 0-17, 18-34, etc.) but don’t want to modify the original table, you could add a second table containing information about the age classes:

AgeClassID	AgeRange	AgeClass
1	0-17	Child
2	18-34	Youth
3	35-49	Adult
4	50+	Senior

This way, you can add additional information without having to adjust the original data, saving a lot of computing and training time for AI models. The user table corresponds to the general-purpose model, and the age classes table to the LoRa model. This should also clarify why LoRa models alone are not particularly useful.

Model Distillation

Sometimes you read about a “distilled” model. This is a method to train “smart” small models. For example, you could train a huge 260B model that can solve extremely complex issues very well. Now you go ahead and train a much smaller model, say a 13B model, not with the original data (which would mean a much poorer parameter fit) but directly with the results of the large model. You “distill” the “knowledge” (i.e., the parameters) that the large teacher model has acquired into a much smaller model. This works best when you want to create a specialized model (e.g., the teacher model knows everything about philosophy, science, etc., and the student model learns mainly about science, perhaps losing some philosophical aspects).

Model Merging

Model merging is a technique to combine different models. This way, the various strengths of the individual models are balanced and combined.

For example: I have a model that is particularly good at physics. I have another model that is particularly good at programming. I need a model that translates physical concepts into simulatable code. Various techniques can be used to combine them:

Model averaging: Take two models with the same architecture, extract the weights of the two models, calculate the average, and create a new model from it.
Ensembling: Both models are run independently, and the predictions are combined through voting, averaging, … and then transferred to a new model.
Layer-wise Ensembling: Here, different layers of the models are combined (e.g., 1,2,3 from A, 4,5,6 from B, and so on).

By model merging, you can create very powerful models by compensating for the weaknesses of existing models with the strengths of others.

Merged models are indicated by designations such as 4x8B, meaning that a new model was created from 4 (number) times 8B parameter models. If the merging is done well, you can get an 8B model that is on par with or even surpasses generalized 32B models in many benchmarks.

Frequently, the config of the merged models is also specified, i.e., which models were asked with which prompts and balancing. For example:

base_model: Model 1
gate_mode: hidden
dtype: bfloat16
experts_per_token: 2
experts:
  - source_model: Model 1
    positive_prompts:
        - "chat"
        - "conversation"
  - source_model: Model 2
    positive_prompts:
        - "science"
        - "physics"
        - "chemistry"
  - source_model: Model 3
    positive_prompts:
        - "language translation"
        - "foreign language"
    negative_prompts:
        - "programming language"
  - source_model: Model 4
        - "Haskell programming language"
        - "SQL query language"
        - "CSS markup styling language"

Here, a base model (base_model) Model 1 with conversational skills was taken and combined with a science, a translation, and a programming expert model. For each input value, a maximum of 2 expert models (experts_per_token) were consulted, and a new value was determined from them. This procedure is also called MoE (Mixture of Experts). More information can also be found at Mergekit .

Prompt Format

As I mentioned at the beginning, AI models are trained with normalized data. To get the best results with your models, it is very important to use the same input format that was used when training the AI. Although AIs also understand other requests (the explanation goes too far here), you use the model much more efficiently if you speak its language (similar to people when someone states the time unusually, like “quarter past 11” or “quarter to 12”). Simplified: If you formulate the question simply, less performance is used to understand it.

Prompt formats are often indicated by something like “chat”, “instruct”, “chat-instruct” or “Alpaca”, “ChatML”, … Just make sure that your client is properly configured (or your script uses the correct format).

Conclusion

I hope this summary helps you get fit in the AI field much faster than I did. Of course, these are just basic information, but hopefully, you now know what you don’t know and can quickly read up on the missing knowledge.