ML Research Notes
Working notes and explorations in machine learning.
Transformer Architecture
Self-Attention Mechanism
The core of transformer models is the self-attention mechanism:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * VWhere:
- Q (Query), K (Key), V (Value) are learned projections
- d_k is the dimension of the key vectors
Key Observations
- Multi-head attention allows the model to attend to different representation subspaces
- Positional encoding is critical for sequence understanding
- Layer normalization placement (pre-norm vs post-norm) significantly affects training stability
Fine-Tuning Strategies
LoRA (Low-Rank Adaptation)
# LoRA reduces trainable parameters dramatically
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(base_model, config)Results Comparison
| Method | Parameters | Accuracy | Training Time |
|---|---|---|---|
| Full Fine-tune | 7B | 94.2% | 48h |
| LoRA (r=16) | 4.2M | 93.8% | 6h |
| LoRA (r=8) | 2.1M | 93.1% | 4h |
LoRA achieves near full fine-tuning performance with a fraction of the trainable parameters — making it ideal for resource-constrained environments.
Reading List
- Attention Is All You Need — Vaswani et al.
- LoRA: Low-Rank Adaptation of Large Language Models — Hu et al.
- Scaling Laws for Neural Language Models — Kaplan et al.
Last updated on