ML Research Notes

Working notes and explorations in machine learning.

Transformer Architecture

Self-Attention Mechanism

The core of transformer models is the self-attention mechanism:


Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where:

Q (Query), K (Key), V (Value) are learned projections
d_k is the dimension of the key vectors

Key Observations

Multi-head attention allows the model to attend to different representation subspaces
Positional encoding is critical for sequence understanding
Layer normalization placement (pre-norm vs post-norm) significantly affects training stability

Fine-Tuning Strategies

LoRA (Low-Rank Adaptation)


# LoRA reduces trainable parameters dramatically
from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)
 
model = get_peft_model(base_model, config)

Results Comparison

Method	Parameters	Accuracy	Training Time
Full Fine-tune	7B	94.2%	48h
LoRA (r=16)	4.2M	93.8%	6h
LoRA (r=8)	2.1M	93.1%	4h

LoRA achieves near full fine-tuning performance with a fraction of the trainable parameters — making it ideal for resource-constrained environments.

Reading List

Attention Is All You Need — Vaswani et al.
LoRA: Low-Rank Adaptation of Large Language Models — Hu et al.
Scaling Laws for Neural Language Models — Kaplan et al.