Fine-Tuning Llama3 on 1 RTX 3060

Overview

Running Google Colab w/ Local Hardware
Preparing the Dataset
Run the Fine-Tune
Inferencing
Lessons Learned

In this post, I'll break down how I used Unsloth's repo for dataset preparation and fine-tuning Llama3. I'll also go over how I'm running the model using llama.cpp.

For some background... when Llama3 was first released, the first thing I heard is how impressive it was from the earliest users. There was some confusion around fine-tuning it. Reportedly, early fine-tunes were far worse than the base model. I've seen mention that it was due to Llama3's use of "special tokens" that existing tools didn't support.

Regardless, I was excited to see what's out there already. So, I set out to find the right way to fine tune Llama3. After a lot of back and forth, I decided to try Unsloth's notebooks for training Llama3 and I'm very glad I did.

Running Google Colab w/ Local Hardware

Unsloth's notebooks are typically hosted on Colab, but you can run the Colab runtime locally using this guide. Connecting my GPU and RAM to my Colab notebook has been a game-changer, allowing me to run the fine-tuning process on my desktop with minimal effort.

Preparing the Dataset

For dataset preparation, I stick to simple file operations with simple Python script(s). Here’s how the final JSON dataset needed to be formatted in my case:

{"conversations": [{"from": "system", "value": "Write a tweet in all capital letters."}, {"from": "human", "value": "The topic is: Rap music"}, {"from": "gpt", "value": "I LOVE RAP MUSIIIIIIIIC"}]}
{"conversations": [{"from": "system", "value": "Write a tweet in all capital letters."}, {"from": "human", "value": "The topic is: Pokemon"}, {"from": "gpt", "value": "SNORLAX. THAT'S THE TWEET."}]}

If you need the code for this, ask GPT-4. It can provide better code than what I'm using! What's important is 1 JSON object per line with a "conversations" key and some kind of conversation that's properly labeled within it.

After a few tries, I've found that that this strategy works best for me:

Find data online
Scrape
Save to CSV File
Remove trash data
Run script that injects system prompt + human prompt + 1 row from CSV at a time and saves as a JSONL file
Combine multiple JSONL files into a massive dataset
Upload to Colab

After preparing around 400 lines of data, I imported it into the Colab space and adjusted the dataset code accordingly:

from datasets import load_dataset
-dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
+dataset = load_dataset("json", data_files={"train": "./full.jsonl"}, split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Run the Fine-Tune

Using Unsloth's Colab notebook, I successfully ran a full fine-tune job locally on an RTX 3060 GPU with 12GB VRAM. After the fine-tuning process, I exported the model (near the end of the notebook):

-if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q5_k_m")
+if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q5_k_m")

This produced a model-unsloth.gguf file in my directory. Downloading it was exhilarating.

Out of Memory?

If you're OOM (out-of-memory) while saving the model, you can kill the python process being used by the Colab container (use nvidia-smi to find the PID.)

To pick up where you left off, add these lines right before the save_pretrained_gguf function from above...

# run the notebook for this cell only
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("lora_model")

if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q5_k_m")

Inferencing

Fine-tuning was only half the battle. Running Llama3 was another can of worms. Initially I tried Ollama, but that hasn't been updated yet. vLLM has been updated, but I used llama.cpp on my development machine because I thought it would have a nice UI.

./server --chat-template llama3 -m ./models/model-unsloth.Q5_K_M.gguf --port 8080

Accessing http://localhost:8080 shows a retro page where system and user prompts can be entered for a response. I had to put everything into the system prompt for it to work as expected. This is fine, because my model is being used for one-shot tasks rather than conversations.

Next steps were to integrate some kind of app/bot into the REST API which presented very little challenge. I simply inspected the DevTools from the llama.cpp app and replicated the POST request.

POST http://127.0.0.1:8080/completion HTTP/1.1
Content-Type: application/json

{
    "stream": false,
    "n_predict": 128,
    "temperature": 0.8,
    "stop": [
        "</s>",
        "Llama:",
        "User:"
    ],
    "repeat_last_n": 256,
    "repeat_penalty": 1.18,
    "penalize_nl": false,
    "top_k": 40,
    "top_p": 0.95,
    "min_p": 0.05,
    "tfs_z": 1,
    "typical_p": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "grammar": "",
    "n_probs": 0,
    "min_keep": 0,
    "image_data": [],
    "cache_prompt": true,
    "api_key": "",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nWrite a tweet in all capital letters.\nThe topic is: New York<|eot_id|><|start_header_id|>user<|end_header_id|>\n<|eot_id|><|start_header_id|>gpt<|end_header_id|>"
}

(btw if you're looking for a postman alternative, checkout the REST Client VSCode Extension)

The responses will contain a content field that holds the AI's prediction. Easy. To make adjustments to the parameters, simply change the POST data. Typically I focused on the "n_predict" and "temperature" parameters. Higher "temperature" is more creative and more "n_predict" will generate more tokens before stopping.

The "prompt" field from above, is using a chat template.¹ This format is specific to the base model and your fine-tune data. If the input chat template doesn't match the correct format, you should expect model degradation. You can read more about Llama3's special tokens and chat templating on Meta's site.

Lessons Learned

Learned a lot about chat templates for Llama3.
Learned that instruct vs. chat vs. completions have different steps for fine-tuning.
Learned that you shouldn't reinvent the wheel when it comes to preparing your dataset. People have the code out there, just find a notebook you can change 1 or 2 lines of.
Run everything in Docker containers for reproducibility and isolation.
Unsloth delivers on the whole efficiency thing.
Join Discord servers for open-source projects if you're struggling. Lots of helpful minds.
Save the run log after training. It can be used to track improvement across your FTs.
Using nvidia-smi and kill -9 <pid> can free up VRAM.

Hugging Face blog on Chat Templates ↩