Type something to search...
Unsloth

Unsloth

Unsloth

8.3k 525
03 May, 2024
  Python

What is Unsloth ?

Unsloth is an open-source framework which allow you to unslow finetuning for Large Language Models.


Unsloth Showcase

Finetune Mistral, Llama 2-5x faster with 50% less memory!

Llama 7bMistral 7bCodeLlama 34bLlama 7b Kaggle 2x T4
2.2x faster 43% less VRAM2.2x faster 62% less VRAM1.9x faster 27% less VRAM5.5x faster 44% less VRAM
⭐Llama free Colab 2x faster⭐Mistral free Colab 2x fasterCodeLlama A100 Colab notebook⭐Kaggle free Alpaca notebook

| Llama A100 Colab notebook | Mistral A100 Colab notebook | 50+ more examples below! | ⭐Kaggle free Slim Orca notebook |

1 A100 40GB🤗 Hugging FaceFlash Attention🦥 Unsloth Open Source🦥 Unsloth Pro
Alpaca1x1.04x1.98x15.64x
LAION Chip21x0.92x1.61x20.73x
OASST1x1.19x2.17x14.83x
Slim Orca1x1.18x2.22x14.82x

Install Unsloth

Installation Instructions - Conda

Select either pytorch-cuda=11.8 for CUDA 11.8 or pytorch-cuda=12.1 for CUDA 12.1.

Terminal window
conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 \
-c pytorch -c nvidia -c xformers -c conda-forge -y
pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"

Installation Instructions - Pip

Do NOT use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

  1. Find your CUDA version via
Terminal window
import torch; torch.version.cuda
  1. For Pytorch 2.1.0: You can update Pytorch via Pip (interchange cu121 / cu118). Go to https://pytorch.org/ to learn more. Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path.
Terminal window
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"
  1. For Pytorch 2.1.1: Use the "ampere" path for newer RTX 30xx GPUs or higher.
Terminal window
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
--index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere_torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere_torch211] @ git+https://github.com/unslothai/unsloth.git"
  1. We’re working on Pytorch 2.1.2 support.

  2. If you get errors, try the below first, then go back to step 1:

Terminal window
pip install --upgrade pip

Documentation

We support Huggingface’s TRL, Trainer, Seq2SeqTrainer or even Pytorch code!

We’re in 🤗 Huggingface’s official docs! We’re on the SFT docs and the DPO docs!

Terminal window
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support - 4x faster downloading!
fourbit_models = [
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
]
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()

DPO (Direct Preference Optimization) Support

DPO, PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.

We’re in 🤗 Huggingface’s official docs! We’re on the SFT docs and the DPO docs!

Terminal window
from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()

Performance comparisons on 1 Tesla T4 GPU:

Time taken for 1 epoch

One Tesla T4 on Google Colab

bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T423h 15m56h 28m8h 38m391h 41m
Unsloth Open1 T413h 7m (1.8x)31h 47m (1.8x)4h 27m (1.9x)240h 4m (1.6x)
Unsloth Pro1 T43h 6m (7.5x)5h 17m (10.7x)1h 7m (7.7x)59h 53m (6.5x)
Unsloth Max1 T42h 39m (8.8x)4h 31m (12.5x)0h 58m (8.9x)51h 30m (7.6x)

Peak Memory Usage

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K)
Huggingface1 T47.3GB5.9GB14.0GB13.3GB
Unsloth Open1 T46.8GB5.7GB7.8GB7.7GB
Unsloth Pro1 T46.4GB6.4GB6.4GB6.4GB
Unsloth Max1 T411.4GB12.4GB11.9GB14.4GB

Performance comparisons on 2 Tesla T4 GPUs via DDP:

Time taken for 1 epoch

Two Tesla T4s on Kaggle

bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T484h 47m163h 48m30h 51m1301h 24m *
Unsloth Pro2 T43h 20m (25.4x)5h 43m (28.7x)1h 12m (25.7x)71h 40m (18.1x) *
Unsloth Max2 T43h 4m (27.6x)5h 14m (31.3x)1h 6m (28.1x)54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

SystemGPUAlpaca (52K)LAION OIG (210K)Open Assistant (10K)SlimOrca (518K) *
Huggingface2 T48.4GB6GB7.2GB5.3GB
Unsloth Pro2 T47.7GB4.9GB7.5GB4.9GB
Unsloth Max2 T410.5GB5GB10.6GB5GB
  • Slim Orca bsz=1 for all benchmarks since bsz=2 OOMs. We can handle bsz=2, but we benchmark it with bsz=1 for consistency.

Llama-Factory 3rd party benchmarking

MethodBitsTGSGRAMSpeed
HF16239218GB100%
HF+FA216295417GB123%
Unsloth+FA216400716GB168%
HF424159GB101%
Unsloth+FA2437267GB160%

Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.


Full benchmarking tables

Click “Code” for a fully reproducible example.

“Unsloth Equal” is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x1.04x1.98x2.48x5.32x15.64x
codeCodeCodeCodeCode
seconds1040100152541919667
memory MB182351536596318525
% saved15.7447.1853.25
1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
---------------------------------------------------------------------------------------------------
LAION Chip21x0.92x1.61x1.84x7.05x20.73x
codeCodeCodeCodeCode
seconds5816313613158228
memory MB7763804777636441
% saved-3.660.0017.03
1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
---------------------------------------------------------------------------------------------------
OASST1x1.19x2.17x2.66x5.04x14.83x
codeCodeCodeCodeCode
seconds18521558852696367125
memory MB26431165651226711223
% saved37.3353.5957.54
1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
---------------------------------------------------------------------------------------------------
Slim Orca1x1.18x2.22x2.64x5.04x14.82x
codeCodeCodeCodeCode
seconds18241545821691362123
memory MB2455715681105959007
% saved36.1456.8663.32

Mistral 7b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Mistral 7B Slim Orca1x1.15x2.15x2.53x4.61x13.69x
codeCodeCodeCodeCode
seconds18131571842718393132
memory MB32853193851246510271
% saved40.9962.0668.74

CodeLlama 34b

1 A100 40GBHugging FaceFlash Attention 2Unsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Code Llama 34BOOM ❌0.99x1.87x2.61x4.27x12.82x
codeCodeCodeCodeCode
seconds195319821043748458152
memory MB40000332172741322161
% saved16.9631.4744.60

1 Tesla T4

1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
Alpaca1x1.09x1.69x1.79x2.93x8.3x
codeCodeCodeCodeCode
seconds15991468942894545193
memory MB7199705964595443
% saved1.9410.2824.39
1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
--------------------------------------------------------------------------------------------------------
LAION Chip21x0.99x1.80x1.75x4.15x11.75x
codeCodeCodeCodeCode
seconds95295552954322981

| memory MB | 6037 | 6033 | 5797 | 4855 | | | | % saved | | 0.07 | 3.98 | 19.58 | | |

1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
OASST1x1.19x1.95x1.86x2.58x7.3x
codeCodeCodeCodeCode
seconds26402222135514211024362
memory MB148271039184137031
% saved29.9243.2652.58
1 T4 16GBHugging FaceFlash AttentionUnsloth OpenUnsloth Pro EqualUnsloth ProUnsloth Max
--------------------------------------------------------------------------------------------------------
Slim Orca1x1.21x1.77x1.85x2.71x7.67x
codeCodeCodeCodeCode
seconds27352262154514781009356
memory MB139331048976616563
% saved24.7245.0252.90

2 Tesla T4s via DDP

2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Alpaca1x0.99x4.95x4.44x7.28x20.61x
codeCodeCodeCode
seconds98829946199622271357480
memory MB9176912869046782
% saved0.5224.7626.09
2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
LAION Chip21x1.12x5.28x4.21x10.01x28.32x
codeCodeCodeCode
seconds5418485410271286541191
memory MB7316731657325934
% saved0.0021.6518.89
2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
OASST (bsz=1)1x1.14x5.56x5.09x5.64x15.97x
codeCodeCodeCode
seconds45033955811885798282
memory MB118961162866167105
% saved2.2544.3840.27
2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Slim Orca (bsz=1)1x0.97x5.54x4.68x6.88x19.46x
codeCodeCodeCode
seconds40424158729863588208
memory MB110101104264927410
% saved-0.2941.0432.70
2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
OASST (bsz=2)OOM ❌OOM ❌
codeCodeCodeCode
secondsOOMOOM271933912794987
memory MBOOMOOM81349600
% savedOOMOOM
2 T4 DDPHugging FaceFlash AttentionUnsloth OpenUnsloth EqualUnsloth ProUnsloth Max
Slim Orca (bsz=2)OOM ❌OOM ❌
codeCodeCodeCode
secondsOOMOOM299034442351831
memory MBOOMOOM75948881
% savedOOMOOM