Establish what data must be backed up before the HDFS upgrade
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Aug 14 2020, 8:54 AM

Description

We are planning to upgrade HDFS from 2.6 to 2.8.5 (CDH -> Bigtop 1.4) and then to 2.10 (Bigtop 1.4 -> 1.5) and to avoid bitter surprises after the upgrade (data completely lost due to file system corruption etc..) we must back up all the data that cannot be recovered from other sources.

This task should be about listing what datasources need to be backupped, and their estimate total size (to inform how big the Hadoop backup cluster should be).

Some datasizes:

Dataset	HDFS folder	Data size	Data size in Tb
aqs	/wmf/data/wmf/aqs	296 Gb	0.3 Tb
banner_activity (no data, empty _SUCCESS files only)	/wmf/data/wmf/banner_activity	0M	0Tb
browser-general	/wmf/data/wmf/browser	5 Mb	0Tb
data_quality_stats	/wmf/data/wmf/data_quality_stats	1.5Mb	0Tb
edit_hourly (1 snapshot)	/wmf/data/wmf/edit/hourly/snapshot=XXXX	5G	0.005 Tb
interlanguage	/wmf/data/wmf/interlanguage	158M	0 Tb
mediacounts	/wmf/data/wmf/mediacounts	14.3Tb	14.3Tb
mediarequests	/wmf/data/wmf/mediarequests	8.6Tb	8.6Tb
mediawiki-history (1 snapshot)	/wmf/data/wmf/mediawiki/history/snapshot=XXX	1.2Tb	1.2Tb
mediawiki-user-history (1 snapshot)	/wmf/data/wmf/mediawiki/page_history/snapshot=XXX	22Gb	0.022Tb
mediawiki-page-history (1 snapshot)	/wmf/data/wmf/mediawiki/user_history/snapshot=XXX	81Gb	0.08 Tb
Geoeditors (details TBD)	/wmf/data/wmf/mediawiki_private	667Mb	0Tb
mobile-apps sessions	/wmf/data/wmf/mobile_apps	223Kb	0Tb
netflow	/wmf/data/wmf/netflow	1.9Tb	1.9Tb
pageview (TBD: Historical)	/wmf/data/wmf/pageview/hourly	22.2Tb	22.2Tb
projectview	/wmf/data/wmf/projectview	8.1Gb	0.008TB
unique-devices	/wmf/data/wmf/unique_devices	706Mb	0.001Tb
virtualpageview	/wmf/data/wmf/virtualpageview	2.1Tb	2.1Tb
Wikidata-entity (1 snapshot)	/wmf/data/wmf/wikidata/entity/snapshot=XXX	108.9Gb	0.11Tb
Wikidata-item-page-link (1 snapshot)	/wmf/data/wmf/wikidata/item_page_link/snapshot=XXX	4.5Gb	0.005Tb

Total size in Tb: 51+Tb

Related Objects
Search...

Status	Assigned	Task
Resolved	JAllemandou	T168554 Default hive table creation to parquet - needs hive 2.3.0
Resolved	elukey	T203498 Upgrade Hive to ≥ 2.0
Resolved	elukey	T203693 Update to CDH 6 or other up-to-date Hadoop distribution
Resolved	elukey	T273711 Upgrade the Analytics Hadoop cluster to Apache Bigtop
Resolved	JAllemandou	T260409 Establish what data must be backed up before the HDFS upgrade

Event Timeline

elukey created this task.Aug 14 2020, 8:54 AM

elukey mentioned this in T260411: Create a temporary hadoop backup cluster.Aug 14 2020, 8:57 AM

IMO we should add what is not re-computable. Here is what comes to my mind:

aqs (stats data of AQS usage)
browser-general
events (the ones that are not deleted, and 90 days of all the rest?)
interlanguage data
mediacounts
mediarequests
mobile-apps sessions
mobile-apps uniques
pageview
unique-devices (all 4 subfolders)
virtualpageview

Things I have not listed: Report-updater generated data ( iwent for oozie only)

(can do a quick checksum to check the main cluster data against the little backup cluster data)

elukey moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.Aug 27 2020, 4:20 PM

Ping @mforns - Could you please review reportupdater data and tell us which should be backedup (if not already done somewhere else) ?

Other things not to forget:

event data that we don't want to loose (for instance mediawiki events) - @Ottomata do we have an easy way to get this list ?
druid indexed data, when it has been through sanitization - This actually leads to an interesting question: Is druid the last backup of some sanitized data, or do we always have sanitized events, in which case we don't need the druid data.

@JAllemandou, responding while Marcel is gone. I looked through and found a few of the reports rely on data that's changing (so if lost, would be hard/impossible to recompute). Why don't we just back up everything, the RU output is teeny tiny.

Why don't we just back up everything, the RU output is teeny tiny.

Happy with that solution - Let's start documenting sizes in the task body.

JAllemandou updated the task description. (Show Details)Sep 14 2020, 7:23 AM

elukey edited projects, added Analytics; removed Analytics-Clusters.Oct 6 2020, 12:40 PM

After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).
Here is the sizing I come up with (using useful size, not replicated one):

Total used size

hdfs dfs -du -s -h / 670Tb

Folders not-to-backup size

`hdfs dfs -du -s -h /var/log`	34Tb
`hdfs dfs -du -s -h /wmf/data/raw`	140Tb
`hdfs dfs -du -s -h /wmf/data/wmf/webrequest/*/year=2020/month=[78]`	83Tb
`hdfs dfs -du -s -h /wmf/data/wmf/mediawiki/wikitext/history`	53Tb

Totals

670 - (34 + 140 + 83 + 53)

360Tb

With a replication factor of 2: 720Tb real, plus some room for machines to breath.

@elukey : You told me we have ~700Tb this morning, and I can probably get a few Tb of some user folders by asking gently. Looks like it fits !

edited for clarity

Some notes about the amount of space that we could get with the backup cluster. We have the following hardware to use:

16 nodes to decommission (analytics1042 -> analytics1057). Total space on each node is ~48T (some hosts might have less since a disk failed etc..)
24 new nodes that still need to be racked/configured/etc... Total space on each node ~48T (hopefully all disks working :D)

We have to keep in mind that a new cluster needs two masters to function properly, so:

we could use the 16 nodes to decom in a 2+14 fashion, but we'd get up to 48TB*14 = 672TB of total HDFS space to use (not sufficient)
we could create a couple of beefy VMs and use all the 16 nodes, getting up to 48TB*16=768TB of total HDFS space to use (sufficient but close to the limit). Creating beefy ganeti VMs is also not advisable, not really sure also how much RAM will be needed for the namenode but surely something like 12/16G for each VM (bare min).
we could decom the 16 old nodes, and use directly the 24 that we are getting for expansion. For example, we could add 4 nodes to the hadoop cluster to get more space, leaving 20 for the backup cluster. We'd use them in a 2+18 fashion, getting up to 48TB*18=864TB of total HDFS space (more than needed, but with extra space in case we want to backup more). After the two HDFS upgrades (since we'll have two versions of Bigtop to test/upgrade to), we could reimage all the backup cluster nodes and add them to the hadoop cluster.

Option 3) is appealing, since we'd also have a smaller cluster to upgrade (less nodes == a quicker procedure etc..). My main concern though is long term, namely when we'll have to do other upgrades with more data (for example, Hadoop 3!). In that case, we'll probably not have a backup cluster available, so what to do? I didn't hear anybody doing HDFS upgrades mentioning backups in talks/internet/etc.., I guess because of this problem. Worth to discuss in my opinion..

Me and Joseph had a chat about possible solutions/scenarios and we came up with this:

we start decomming the analytics1042 -> analytics1057, it should take a couple of weeks (one node per day)
we work with DC ops to rack/configure the new 24 nodes asap

The first of the two that gets to a good size (say 10/12 nodes) becomes that backup cluster, running bigtop, that we'll use to test/start copying data from our prod cluster. The idea is to start this copy/test process asap, since there will surely be some things to follow up.

After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).

You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions

You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions

"unprocessed" as in raw-from-camus (the /wmf/data/raw doesn't get backed-up). /wmf/data/event and /wmf/data/event_sanitized are both planed to be backed-up.

Ok after a quick chat with Dan it appears my way of presenting might have been confusing. Here are some clarification (hopefully?)
We wish to backup everything except

raw data, meaning all of /wmf/data/raw. Data not backed up includes raw data from camus (webrequest, events, netflow), dumps copied from labstore, sqooped data.
2 month of refined webrequest (oldest) as in /wmf/data/wmf/webrequest/*/year=2020/month=[78]
processed wikitext, as it is huge and can be regenerated (even if long)

@Milimetric - Can you please confirm it's clearer ?

• razzi assigned this task to JAllemandou.Oct 15 2020, 4:02 PM

• razzi triaged this task as High priority.

• razzi raised the priority of this task from High to Needs Triage.

• razzi triaged this task as High priority.

• razzi added a project: Analytics-Kanban.

• razzi moved this task from Incoming to Operational Excellence on the Analytics board.

Iflorez subscribed.Oct 20 2020, 11:20 PM

JAllemandou moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.Oct 26 2020, 4:30 PM

• fdans moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.Nov 16 2020, 4:02 PM

• fdans moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.

nshahquinn-wmf added a parent task: T273711: Upgrade the Analytics Hadoop cluster to Apache Bigtop.Feb 11 2021, 2:24 PM