Fine-Tuning LLMs: Parameter-Efficient Techniques (LoRA, QLoRA, PEFT)

Simor Consulting | 03 Oct, 2025 | 05 Mins read

Fine-tuning a 70B parameter model costs $50K+ and requires weeks of training on expensive hardware. This is the reality for teams building domain-specific language models. Traditional full-parameter fine-tuning is often impractical: the GPU requirements alone (2TB+ of memory for gradients, optimizer states, and activations) exceed most teams’ budgets. The choice seems to be between generic models that lack domain knowledge or expensive custom models that bankrupt the company.

Parameter-efficient fine-tuning (PEFT) solves this by modifying only a small subset of parameters while keeping the base model frozen.

The Problem with Full Fine-Tuning

Fine-tuning a 70B parameter model requires:

8x A100 80GB GPUs minimum ($15,000/month each on cloud)
Weeks of training time
Massive storage for checkpoints
Engineering effort for distributed training

Memory breakdown for a single fine-tuning run:

# Traditional fine-tuning memory calculation
model_params = 70_000_000_000  # 70B parameters
bytes_per_param = 4  # FP32
model_memory = model_params * bytes_per_param / 1e9  # 280 GB

# Training memory requirements:
# - Gradients: 280 GB
# - Optimizer states (Adam): 560 GB
# - Activations: ~1 TB for reasonable batch size
# Total: ~2 TB of GPU memory needed

Deployment is equally painful: each fine-tuned model is a full copy (280GB), requiring expensive inference infrastructure.

LoRA: Low-Rank Adaptation

LoRA is based on the insight that weight updates during fine-tuning have low intrinsic rank. Instead of updating a weight matrix W directly, LoRA decomposes the update into two smaller matrices.

# Traditional fine-tuning
W_new = W_pretrained + ΔW  # ΔW is full rank (huge)

# LoRA approach
W_new = W_pretrained + BA  # B and A are low rank (tiny)
# Where B ∈ R^(d×r) and A ∈ R^(r×k) with r << min(d,k)

Implementation:

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=16, alpha=32):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        # Frozen pretrained weights (not trained)
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.weight.requires_grad = False

        # LoRA matrices (trained)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        # Initialize A with Gaussian, B with zeros
        nn.init.normal_(self.lora_A, std=1/rank)

    def forward(self, x):
        # Original computation
        original = F.linear(x, self.weight)

        # LoRA adaptation
        lora = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling

        return original + lora

Applying LoRA to transformer attention layers:

class LoRATransformer:
    def __init__(self, base_model, lora_config):
        self.base_model = base_model
        self.lora_config = lora_config
        self.lora_modules = {}

        # Freeze base model
        for param in self.base_model.parameters():
            param.requires_grad = False

        # Add LoRA to attention layers
        self.inject_lora()

    def inject_lora(self):
        for name, module in self.base_model.named_modules():
            if isinstance(module, nn.Linear) and any(
                key in name for key in ['q_proj', 'v_proj', 'k_proj', 'o_proj']
            ):
                # Replace with LoRA-enhanced version
                lora_module = LoRALinear(
                    module.in_features,
                    module.out_features,
                    rank=self.lora_config.rank,
                    alpha=self.lora_config.alpha
                )

                # Copy pretrained weights
                lora_module.weight.data = module.weight.data.clone()

                # Store reference
                self.lora_modules[name] = lora_module

                # Replace in model
                parent = self.get_parent_module(name)
                setattr(parent, name.split('.')[-1], lora_module)

Results comparison:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Parameter Reduction: From 70B to 40M trainable (99.94% reduction)
Memory Savings: From 2TB to 80GB (96% reduction)
Cost Reduction: From $50K to $2K per run (96% reduction)
Training Speed: From weeks to days (5-10x faster)

QLoRA: Quantized LoRA

QLoRA combines LoRA with quantization to enable fine-tuning on consumer GPUs. Key innovations:

4-bit NormalFloat Quantization: Information-theoretically optimal data type
Double Quantization: Quantizing the quantization constants
Paged Optimizers: Managing memory spikes during training

Implementation:

class QLoRAModel:
    def __init__(self, model_name, bnb_config):
        # BitsAndBytes configuration for 4-bit loading
        self.bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True
        )

        # Load model in 4-bit
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=self.bnb_config,
            device_map="auto",
            trust_remote_code=True
        )

        # Prepare for LoRA
        self.model = prepare_model_for_kbit_training(self.model)

    def add_qlora_adapters(self, lora_config):
        # Configure LoRA
        peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM,
            inference_mode=False,
            r=lora_config.rank,
            lora_alpha=lora_config.alpha,
            lora_dropout=lora_config.dropout,
            target_modules=[
                "q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"
            ]
        )

        # Add adapters
        self.model = get_peft_model(self.model, peft_config)

        # Enable gradient checkpointing
        self.model.enable_input_require_grads()
        self.model.gradient_checkpointing_enable()

        return self.model

Memory-efficient training configuration:

class QLoRATrainer:
    def __init__(self, model, dataset, training_args):
        self.model = model
        self.dataset = dataset

        # Optimized training arguments
        self.training_args = TrainingArguments(
            output_dir="./qlora-legal",
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=100,
            max_steps=10000,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=25,
            optim="paged_adamw_32bit",  # Paged optimizer
            gradient_checkpointing=True,  # Save memory
            report_to="wandb"
        )

    def train(self):
        # Custom data collator for padding
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False,
            pad_to_multiple_of=8  # Efficient for GPU
        )

        # Initialize trainer
        trainer = Trainer(
            model=self.model,
            train_dataset=self.dataset,
            args=self.training_args,
            data_collator=data_collator,
            callbacks=[SavePeftModelCallback]  # Save only adapters
        )

        # Train with automatic mixed precision
        with torch.cuda.amp.autocast():
            trainer.train()

QLoRA results:

Fine-tuned 70B parameter models on a single A100 40GB GPU
Reduced memory usage to 35GB during training
Maintained 97% of full fine-tuning performance
Enabled experimentation on consumer hardware

Other PEFT Techniques

Prefix Tuning

Instead of modifying weights, prefix tuning adds trainable tokens to the input:

class PrefixTuningModel(nn.Module):
    def __init__(self, base_model, prefix_length=20):
        super().__init__()
        self.base_model = base_model
        self.prefix_length = prefix_length

        # Trainable prefix embeddings
        self.prefix_embeddings = nn.Parameter(
            torch.randn(prefix_length, base_model.config.hidden_size)
        )

        # Freeze base model
        for param in self.base_model.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask=None):
        batch_size = input_ids.shape[0]

        # Expand prefix for batch
        prefix = self.prefix_embeddings.unsqueeze(0).expand(
            batch_size, -1, -1
        )

        # Concatenate prefix with input embeddings
        inputs_embeds = self.base_model.get_input_embeddings()(input_ids)
        inputs_embeds = torch.cat([prefix, inputs_embeds], dim=1)

        # Adjust attention mask
        if attention_mask is not None:
            prefix_mask = torch.ones(
                batch_size, self.prefix_length,
                device=attention_mask.device
            )
            attention_mask = torch.cat([prefix_mask, attention_mask], dim=1)

        # Forward through model
        outputs = self.base_model(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask
        )

        return outputs

Adapter Layers

Small neural networks inserted between transformer layers:

class AdapterLayer(nn.Module):
    def __init__(self, hidden_size, adapter_size=64):
        super().__init__()
        self.down_project = nn.Linear(hidden_size, adapter_size)
        self.activation = nn.ReLU()
        self.up_project = nn.Linear(adapter_size, hidden_size)

        # Initialize with near-identity
        nn.init.normal_(self.down_project.weight, std=1e-3)
        nn.init.zeros_(self.down_project.bias)
        nn.init.normal_(self.up_project.weight, std=1e-3)
        nn.init.zeros_(self.up_project.bias)

    def forward(self, x):
        residual = x
        x = self.down_project(x)
        x = self.activation(x)
        x = self.up_project(x)
        return x + residual  # Residual connection

IA³ (Infused Adapter by Inhibiting and Amplifying)

Multiplicative adaptation instead of additive:

class IA3Layer(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        # Learnable scaling vectors
        self.scaling = nn.Parameter(torch.ones(hidden_size))

    def forward(self, x, weight):
        # Element-wise scaling of activations
        scaled_weight = weight * self.scaling.unsqueeze(0)
        return F.linear(x, scaled_weight)

Practical Patterns

Multi-Task Learning with LoRA

class MultiTaskLoRA:
    def __init__(self, base_model):
        self.base_model = base_model
        self.lora_modules = {}

    def add_task(self, task_name, lora_config):
        """Add a new task-specific LoRA module"""
        lora_module = LoRAModule(lora_config)
        self.lora_modules[task_name] = lora_module

    def forward(self, x, task_name):
        """Forward pass with task-specific adaptation"""
        # Get base model output
        base_output = self.base_model(x)

        # Apply task-specific LoRA
        if task_name in self.lora_modules:
            adapted_output = self.lora_modules[task_name](base_output)
            return adapted_output

        return base_output

    def merge_tasks(self, task_weights):
        """Merge multiple LoRA modules with weights"""
        merged_lora = {}

        for task, weight in task_weights.items():
            if task in self.lora_modules:
                for param_name, param in self.lora_modules[task].named_parameters():
                    if param_name not in merged_lora:
                        merged_lora[param_name] = weight * param
                    else:
                        merged_lora[param_name] += weight * param

        return merged_lora

Hyperparameter Search

class LoRAHyperparameterSearch:
    def __init__(self, base_model, train_dataset, eval_dataset):
        self.base_model = base_model
        self.train_dataset = train_dataset
        self.eval_dataset = eval_dataset

    def search(self):
        search_space = {
            'rank': [4, 8, 16, 32, 64],
            'alpha': [16, 32, 64, 128],
            'dropout': [0.0, 0.05, 0.1],
            'target_modules': [
                ['q_proj', 'v_proj'],
                ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
                ['all_linear']
            ]
        }

        results = []

        for rank in search_space['rank']:
            for alpha in search_space['alpha']:
                for dropout in search_space['dropout']:
                    for modules in search_space['target_modules']:
                        config = LoraConfig(
                            r=rank,
                            lora_alpha=alpha,
                            lora_dropout=dropout,
                            target_modules=modules
                        )

                        # Train and evaluate
                        score = self.train_and_evaluate(config)

                        results.append({
                            'config': config,
                            'score': score,
                            'params': rank * len(modules) * 2048
                        })

        return self.get_pareto_optimal(results)

Production Deployment

class LoRAModelServer:
    def __init__(self, base_model_path):
        # Load base model once
        self.base_model = AutoModelForCausalLM.from_pretrained(
            base_model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )

        # Cache for loaded adapters
        self.adapter_cache = LRUCache(maxsize=10)

    def load_adapter(self, adapter_path):
        """Load and cache LoRA adapter"""
        if adapter_path in self.adapter_cache:
            return self.adapter_cache[adapter_path]

        # Load adapter weights
        adapter = LoRAAdapter.from_pretrained(adapter_path)
        self.adapter_cache[adapter_path] = adapter

        return adapter

    def inference(self, text, adapter_path):
        """Run inference with specific adapter"""
        # Load adapter
        adapter = self.load_adapter(adapter_path)

        # Apply adapter to base model
        with adapter.apply_to(self.base_model):
            outputs = self.base_model.generate(
                text,
                max_length=512,
                temperature=0.7
            )

        return outputs

Technique Selection

def select_peft_method(requirements):
    """Guide for selecting PEFT technique"""

    if requirements['model_size'] > 30e9 and requirements['gpu_memory'] < 48:
        return "QLoRA"  # Extreme memory constraints

    elif requirements['latency_critical']:
        return "IA3"  # Minimal inference overhead

    elif requirements['multi_task']:
        return "LoRA"  # Easy task switching

    elif requirements['few_shot_learning']:
        return "Prefix-Tuning"  # Good for prompting

    else:
        return "LoRA"  # Good default choice

Decision Rules

Adopt parameter-efficient fine-tuning when:

Full fine-tuning exceeds your GPU budget
Multiple specialized models are needed
Fast iteration is required
Model versioning for different tasks is needed

Stick with full fine-tuning when:

PEFT performance is insufficient for your use case
You have dedicated GPU clusters and budget
Maximum model performance is critical

Key principles:

Start with LoRA: mature, well-supported, effective
Use QLoRA for extreme memory constraints
Data quality matters more with fewer parameters to tune: 1,000 expert-curated examples beats 100,000 noisy ones
Low costs enable experimentation: use this advantage

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.