Skip to Content
DocsResearchML Research Notes

ML Research Notes

Working notes and explorations in machine learning.

Transformer Architecture

Self-Attention Mechanism

The core of transformer models is the self-attention mechanism:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where:

  • Q (Query), K (Key), V (Value) are learned projections
  • d_k is the dimension of the key vectors

Key Observations

  1. Multi-head attention allows the model to attend to different representation subspaces
  2. Positional encoding is critical for sequence understanding
  3. Layer normalization placement (pre-norm vs post-norm) significantly affects training stability

Fine-Tuning Strategies

LoRA (Low-Rank Adaptation)

# LoRA reduces trainable parameters dramatically from peft import LoraConfig, get_peft_model config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none" ) model = get_peft_model(base_model, config)

Results Comparison

MethodParametersAccuracyTraining Time
Full Fine-tune7B94.2%48h
LoRA (r=16)4.2M93.8%6h
LoRA (r=8)2.1M93.1%4h

LoRA achieves near full fine-tuning performance with a fraction of the trainable parameters — making it ideal for resource-constrained environments.

Reading List

  • Attention Is All You Need — Vaswani et al.
  • LoRA: Low-Rank Adaptation of Large Language Models — Hu et al.
  • Scaling Laws for Neural Language Models — Kaplan et al.
Last updated on