WIP
Activation Tracking:
Weight Update Tracking:
- Kaplan vs Chinchilla
- SP vs muP vs layerwise SP
- adamW weight decay
- infinite lr scheduler
- batch_size vs lr (sqrt BS law)
- agd-muP (spectral initialization vs classic muP)
- adam-atan2
- data dependent lr tranfer
- embedding lr transfer
- u-muP
Thanks to Fal.ai for providing compute to run these experiments.