Finetuning LLMs
Finetune Llama, Mistral, Mixtral and more on one or more GPUs
LLMs are pre-trained on massive datasets of text and code. This makes them versatile for various tasks, but they may not perform optimally on your specific domain or data.
Finetuning allows you to train these models on your data, enhancing their performance and tailoring them to your unique requirements.
Fine-tuning with TrueFoundry allows you to bring your data, and fine-tune popular Open Source LLM's such as Llama 2, Mistral, Zephyr, Mixtral, and more. This is made easy, as we provide pre-configured options for resources and use the optimal training techniques available. You can choose to perform fine-tuning either using Jobs or Notebooks. You can further, easily track the progress of finetuning through ML-Repositories.
Supported Architectures and Sizes
We support model sizes of up to 70B for the following model architectures
Following architectures are supported on best effort basis
QLoRA
For fine-tuning, TrueFoundry embraces the QLoRA technique, a cutting-edge technique that revolutionizes fine-tuning by balancing power and efficiency. This technique uses clever tricks to stay compact, so you can fine-tune on smaller hardware (even just one GPU), saving time, money, and resources, all while maintaining top performance.
Pre-requisites
Before you begin, ensure you have the following:
- Workspace:
To deploy your LLM, you'll need a workspace. If you don't have one, you can create it using this guide: Create a Workspace or seek assistance from your cluster administrator.
Setting up the Training Data
We support two different data formats:
Chat
Data needs to be in jsonl
format with each line containing a whole conversation in OpenAI Chat format
Each line contains a key called messages
. Each messages
key contains a list of messages, where each message is a dictionary with role
and content
keys. The role
key can be either user
, assistant
or system
and the content
key contains the message content.
Example:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris"}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare"}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers"}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
...
Completion
Data needs to be in jsonl
format with each line containing a json encoded string containing two keys prompt and completion.
Example:
{"prompt": "What is 2 + 2?", "completion": "The answer to 2 + 2 is 4"}
{"prompt": "Flip a coin", "completion": "I flipped a coin and the result is heads!"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...
You can further split your data into training data and evaluation data.
Once your data is prepared, you need to store the data somewhere. You can choose where to store your data:
- TrueFoundry Artifact: Upload it as a TrueFoundry artifact for easy access.
- Cloud Storage: Upload it to a cloud storage service.
- Local Machine: Save it directly on your computer.
Upload to a TrueFoundry Artifact
If you prefer to upload your training data directly to TrueFoundry as an artifact, follow the Add Artifacts via UI, and Upload your .jsonl
training data file.
Upload to a cloud storage
You can upload your data to a S3 Bucket using the following command:
aws s3 cp "path-to-training-data-file-locally" s3://bucket-name/dir-to-store-file
Once done you can generate a pre-signed URL of the S3 Object using the following command:
aws s3 presign s3://bucket-name/path-to-training-data-file-in-s3
Now you can use this pre-signed URL in the fine-tuning job / notebook.
Similarly, you can also upload to AZURE BLOB and GCP GCS.
Fine-Tuning a LLM
Now that your data is prepared, you can start the fine-tuning.
Once your data is ready, you can now start fine-tuning your LLM. Here you have two options, deploying a fine-tuning notebook for experimentation or launching a dedicated fine-tuning job.
- Notebooks: Experimentation Playground
Notebooks offer an ideal setup for explorative and iterative fine-tuning. You can experiment on a small subset of data, trying different hyperparameters to figure out the ideal configuration for the best performance. Thanks to the interactive setup, you can analyze the intermediate results to gain deeper insights into the LLM's behavior and response to different training parameters.
Therefore, notebooks are strongly recommended for early-stage exploration and hyperparameter tuning.
- Jobs: Reliable and Scalable
Once you've identified the optimal hyperparameters and configuration through experimentation, transitioning to a deployment job helps you fine-tune on whole dataset and facilitates rapid and reliable training. It ensures consistent and reproducible training runs, as well as built-in retry mechanisms automatically handle any hiccups, ensuring seamless training without manual intervention
Consequently, deployment jobs are the preferred choice for large-scale LLM finetuning, particularly when the optimal configuration has been established through prior experimentation.
Hyperparameters
Fine-tuning an LLM requires adjusting key parameters to optimize its performance on your specific task. Here are some crucial hyperparameters to consider:
- Epochs: This determines the number of times the model iterates through the entire training dataset.
Too many epochs can lead to overfitting, and too few might leave the model undertrained. You should start with a moderate number and increase until the validation performance starts dropping. - Learning Rate: This defines how quickly the model updates its weights based on errors.
Too high can cause instability and poor performance, and too low can lead to slow learning. Start small and gradually increase if the finetuning is slow. - Batch Size: This controls how many data points the model processes before adjusting its internal parameters. Choose a size based on memory constraints and desired training speed. Too high can strain resources, and too low might lead to unstable updates.
- Lora Alpha and R: These control the adaptive scaling of weights in the Lora architecture, improving efficiency for large models. These are useful parameters for generalization. High values might lead to instability, low values might limit potential performance.
- Max Length : This defines the maximum sequence length the model can process at once.
Choose based on your task's typical input and output lengths. Too short can truncate context, and too long can strain resources and memory.
The optimal values for these hyperparameters depend on your specific LLM, task, and dataset. Be prepared to experiment and iteratively refine your settings for optimal performance.
Fine-Tuning using a Notebook
Fine-Tuning using a Job
Before you start, you will first need to create an ML Repo (this will be used to store your training metrics and artifacts, such as your checkpoints and models) and give your workspace access to the ML Repo. You can read more about ML Repo's here
Now that your ML Repo is set up, you can create the fine-tuning job.
Deploying the Fine-Tuned Model
Once your Fine-tuning is complete, the next step is to deploy the fine-tuned LLM.
You can learn more about how to send requests to your Deploy LLM using the following guide
Advanced: Merging LoRa Adapters and uploading merged model
Currently the finetuning Job/Notebook only converts the best checkpoint to a merged model to save GPU compute time. But sometimes we might want to pick an intermediate checkpoint, merge it and re-upload it as a model.
Upcoming Improvments
We are working on building this feature as part of the platform. Till then the code in this section can be run locally or in a Notebook on the platform
First, let's make sure we have following requirements installed
--extra-index-url https://download.pytorch.org/whl/cu121
torch==2.3.0+cu121
tokenizers==0.19.1
transformers==4.42.3
accelerate==0.31.0
peft==0.11.1
truefoundry[ml]>=0.2.8,<1.0.0
Make sure you are logged in to TrueFoundry
tfy login --host <Your TrueFoundry Platform URL>
Next, save this script
import os
import math
import os
import re
import shutil
from typing import Any, Dict, Optional
import numpy as np
from huggingface_hub import scan_cache_dir
from truefoundry import ml as mlfoundry
import argparse
import json
from truefoundry.ml import get_client
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
def get_or_create_run(
ml_repo: str, run_name: str, auto_end: bool = False, create_ml_repo: bool = False
):
client = mlfoundry.get_client()
if create_ml_repo:
client.create_ml_repo(ml_repo=ml_repo)
try:
run = client.get_run_by_name(ml_repo=ml_repo, run_name=run_name)
except Exception as e:
if "RESOURCE_DOES_NOT_EXIST" not in str(e):
raise
run = client.create_run(ml_repo=ml_repo, run_name=run_name, auto_end=auto_end)
return run
def log_model_to_mlfoundry(
run: mlfoundry.MlFoundryRun,
model_name: str,
model_dir: str,
hf_hub_model_id: str,
metadata: Optional[Dict[str, Any]] = None,
step: int = 0,
):
metadata = metadata or {}
print("Uploading Model...")
hf_cache_info = scan_cache_dir()
files_to_save = []
for repo in hf_cache_info.repos:
if repo.repo_id == hf_hub_model_id:
for revision in repo.revisions:
for file in revision.files:
if file.file_path.name.endswith(".py"):
files_to_save.append(file.file_path)
break
# copy the files to output_dir of pipeline
for file_path in files_to_save:
match = re.match(r".*snapshots\/[^\/]+\/(.*)", str(file_path))
if match:
relative_path = match.group(1)
destination_path = os.path.join(model_dir, relative_path)
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
shutil.copy(str(file_path), destination_path)
else:
print("Python file in hf model cache in unknown path:", file_path)
metadata.update(
{
"pipeline_tag": "text-generation",
"library_name": "transformers",
"base_model": hf_hub_model_id,
"huggingface_model_url": f"https://huggingface.co/{hf_hub_model_id}"
}
)
metadata = {
k: v
for k, v in metadata.items()
if isinstance(v, (int, float, np.integer, np.floating)) and math.isfinite(v)
}
run.log_model(
name=model_name,
model_file_or_folder=model_dir,
framework=mlfoundry.ModelFramework.TRANSFORMERS,
metadata=metadata,
step=step,
)
print(f"You can view the model at {run.dashboard_link}?tab=models")
def merge_and_upload(
hf_hub_model_id: str,
ml_repo: str,
run_name: str,
artifact_version_fqn: str,
saved_model_name: str,
dtype: str = "bfloat16",
device_map: str = "auto",
):
import torch
client = get_client()
if device_map.startswith("{"):
device_map = json.loads(device_map)
artifact_version = client.get_artifact_version_by_fqn(artifact_version_fqn)
lora_model_path = artifact_version.download()
tokenizer = AutoTokenizer.from_pretrained(hf_hub_model_id)
model = AutoModelForCausalLM.from_pretrained(hf_hub_model_id, device_map=device_map, torch_dtype=getattr(torch, dtype))
model = PeftModel.from_pretrained(model, lora_model_path)
model = model.merge_and_unload(progressbar=True)
merged_model_dir = os.path.abspath("./merged")
os.makedirs(merged_model_dir, exist_ok=True)
tokenizer.save_pretrained(merged_model_dir)
model.save_pretrained(merged_model_dir)
run = get_or_create_run(
ml_repo=ml_repo,
run_name=run_name,
auto_end=False,
create_ml_repo=False,
)
log_model_to_mlfoundry(
run=run,
model_name=saved_model_name,
model_dir=merged_model_dir,
hf_hub_model_id=hf_hub_model_id,
metadata={
"checkpoint": artifact_version_fqn,
},
step=artifact_version.step,
)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--hf_hub_model_id", type=str, required=True)
parser.add_argument("--ml_repo", type=str, required=True)
parser.add_argument("--run_name", type=str, required=True)
parser.add_argument("--artifact_version_fqn", type=str, required=True)
parser.add_argument("--saved_model_name", type=str, required=True)
parser.add_argument("--dtype", type=str, default="bfloat16", choices=["bfloat16", "float16", "float32"])
parser.add_argument("--device_map", type=str, default="auto")
args = parser.parse_args()
merge_and_upload(
hf_hub_model_id=args.hf_hub_model_id,
ml_repo=args.ml_repo,
run_name=args.run_name,
artifact_version_fqn=args.artifact_version_fqn,
saved_model_name=args.saved_model_name,
dtype=args.dtype,
device_map=args.device_map,
)
if __name__ == "__main__":
main()
Finally, run this script.
Run --help
to see help
usage: merge_and_upload.py [-h] --hf_hub_model_id HF_HUB_MODEL_ID --ml_repo ML_REPO --run_name RUN_NAME --artifact_version_fqn ARTIFACT_VERSION_FQN
--saved_model_name SAVED_MODEL_NAME [--dtype {bfloat16,float16,float32}] [--device_map DEVICE_MAP]
options:
-h, --help show this help message and exit
--hf_hub_model_id HF_HUB_MODEL_ID
HuggingFace Hub model id to merge the LoRa adapter with. E.g. `stas/tiny-random-llama-2`
--ml_repo ML_REPO ML repo to log the merged model to
--run_name RUN_NAME Name of the run to log the merged model to. If the run does not exist, it will be created.
--artifact_version_fqn ARTIFACT_VERSION_FQN
Artifact version FQN of the LoRa adapter to merge with the HF model
--saved_model_name SAVED_MODEL_NAME
Name of the model to log to MLFoundry
--dtype {bfloat16,float16,float32}
Data type to load the base model
--device_map DEVICE_MAP
device_map to use when loading the model. auto, cpu, or a dictionary of device_map
E.g. usage which merges checkpoint artifact:truefoundry/llm-experiments/ckpt-finetune-2024-03-14T05-00-55:7
with its base model stas/tiny-random-llama-2
and uploads it as finetuned-tiny-random-llama-checkpoint-7
to run finetune-2024-03-14T05-00-55
in ML Repo llm-experiments
python merge_and_upload.py \
--hf_hub_model_id stas/tiny-random-llama-2 \
--ml_repo llm-experiments \
--run_name finetune-2024-03-14T05-00-55 \
--artifact_version_fqn artifact:truefoundry/llm-experiments/ckpt-finetune-2024-03-14T05-00-55:7 \
--saved_model_name finetuned-tiny-random-llama-checkpoint-7 \
--dtype bfloat16 \
--device_map auto
Updated about 1 month ago