We are planning to upgrade HDFS from 2.6 to 2.8.5 (CDH -> Bigtop 1.4) and then to 2.10 (Bigtop 1.4 -> 1.5) and to avoid bitter surprises after the upgrade (data completely lost due to file system corruption etc..) we must back up all the data that cannot be recovered from other sources.
This task should be about listing what datasources need to be backupped, and their estimate total size (to inform how big the Hadoop backup cluster should be).
Some datasizes:
Dataset | HDFS folder | Data size | Data size in Tb |
aqs | /wmf/data/wmf/aqs | 296 Gb | 0.3 Tb |
banner_activity (no data, empty _SUCCESS files only) | /wmf/data/wmf/banner_activity | 0M | 0Tb |
browser-general | /wmf/data/wmf/browser | 5 Mb | 0Tb |
data_quality_stats | /wmf/data/wmf/data_quality_stats | 1.5Mb | 0Tb |
edit_hourly (1 snapshot) | /wmf/data/wmf/edit/hourly/snapshot=XXXX | 5G | 0.005 Tb |
interlanguage | /wmf/data/wmf/interlanguage | 158M | 0 Tb |
mediacounts | /wmf/data/wmf/mediacounts | 14.3Tb | 14.3Tb |
mediarequests | /wmf/data/wmf/mediarequests | 8.6Tb | 8.6Tb |
mediawiki-history (1 snapshot) | /wmf/data/wmf/mediawiki/history/snapshot=XXX | 1.2Tb | 1.2Tb |
mediawiki-user-history (1 snapshot) | /wmf/data/wmf/mediawiki/page_history/snapshot=XXX | 22Gb | 0.022Tb |
mediawiki-page-history (1 snapshot) | /wmf/data/wmf/mediawiki/user_history/snapshot=XXX | 81Gb | 0.08 Tb |
Geoeditors (details TBD) | /wmf/data/wmf/mediawiki_private | 667Mb | 0Tb |
mobile-apps sessions | /wmf/data/wmf/mobile_apps | 223Kb | 0Tb |
netflow | /wmf/data/wmf/netflow | 1.9Tb | 1.9Tb |
pageview (TBD: Historical) | /wmf/data/wmf/pageview/hourly | 22.2Tb | 22.2Tb |
projectview | /wmf/data/wmf/projectview | 8.1Gb | 0.008TB |
unique-devices | /wmf/data/wmf/unique_devices | 706Mb | 0.001Tb |
virtualpageview | /wmf/data/wmf/virtualpageview | 2.1Tb | 2.1Tb |
Wikidata-entity (1 snapshot) | /wmf/data/wmf/wikidata/entity/snapshot=XXX | 108.9Gb | 0.11Tb |
Wikidata-item-page-link (1 snapshot) | /wmf/data/wmf/wikidata/item_page_link/snapshot=XXX | 4.5Gb | 0.005Tb |
Total size in Tb: 51+Tb