In the context of T377941 and while we were finding friction points of ML-Lab usage, it became clear that we need to address some issues:
- Disk space on the lab machines is an issue: models tend to be multi-GB binaires, and just the drivers (ROCm) that Pythorch bundles make a venv be 16+GB unless we address this
- Getting data from/to the lab machines, especially datasets that may come from statboxes or HDFS is very difficult, and may acvcidentally expose PII (copying via untrusted hosts)
- Sharing the HuggingFace cache between users would be desirable for similar reasons as (1)
To address these points, the plan of action is the following:
- Move one lab machine (1002) into the analytics subnet and reinstall it.
- Change the lab-machine role (for that machine) to install Kerberos, HDFS clients etc. as needed (extract this functionality from the Statbox Puppet role)
- Once this is confirmed working for normal users, switch the /home subdir to use a CephFS volume (also see T378735), check that Puppet can just populate that as usual. Setup quota tracking
- Reinstall the machine from scratch to ensure reproducibility
- Migrate ml-lab1001 to the analytics subnet and reinstall it with the newly created role above.
Most of this work will be done by @klausman, with help from Data Platform SRE (Ben and Brian). It will also serve as a test balloon for the Bookworm variants of some of the statbox config/role, since those machines are still on Bullseye.
In parallel, we will still get storage expansion as mentioned in T377941, so we can use the extra space as local (fast) backing store for the Pytorch-ROCm venv and shared Huggingface cache or other future usecases (e.g. fast tmpdir for model training data).