Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

Establish what data must be backed up before the HDFS upgrade
Closed, ResolvedPublic

Description

We are planning to upgrade HDFS from 2.6 to 2.8.5 (CDH -> Bigtop 1.4) and then to 2.10 (Bigtop 1.4 -> 1.5) and to avoid bitter surprises after the upgrade (data completely lost due to file system corruption etc..) we must back up all the data that cannot be recovered from other sources.

This task should be about listing what datasources need to be backupped, and their estimate total size (to inform how big the Hadoop backup cluster should be).

Some datasizes:

DatasetHDFS folderData sizeData size in Tb
aqs/wmf/data/wmf/aqs296 Gb0.3 Tb
banner_activity (no data, empty _SUCCESS files only)/wmf/data/wmf/banner_activity0M0Tb
browser-general/wmf/data/wmf/browser5 Mb0Tb
data_quality_stats/wmf/data/wmf/data_quality_stats1.5Mb0Tb
edit_hourly (1 snapshot)/wmf/data/wmf/edit/hourly/snapshot=XXXX5G0.005 Tb
interlanguage/wmf/data/wmf/interlanguage158M0 Tb
mediacounts/wmf/data/wmf/mediacounts14.3Tb14.3Tb
mediarequests/wmf/data/wmf/mediarequests8.6Tb8.6Tb
mediawiki-history (1 snapshot)/wmf/data/wmf/mediawiki/history/snapshot=XXX1.2Tb1.2Tb
mediawiki-user-history (1 snapshot)/wmf/data/wmf/mediawiki/page_history/snapshot=XXX22Gb0.022Tb
mediawiki-page-history (1 snapshot)/wmf/data/wmf/mediawiki/user_history/snapshot=XXX81Gb0.08 Tb
Geoeditors (details TBD)/wmf/data/wmf/mediawiki_private667Mb0Tb
mobile-apps sessions/wmf/data/wmf/mobile_apps223Kb0Tb
netflow/wmf/data/wmf/netflow1.9Tb1.9Tb
pageview (TBD: Historical)/wmf/data/wmf/pageview/hourly22.2Tb22.2Tb
projectview/wmf/data/wmf/projectview8.1Gb0.008TB
unique-devices/wmf/data/wmf/unique_devices706Mb0.001Tb
virtualpageview/wmf/data/wmf/virtualpageview2.1Tb2.1Tb
Wikidata-entity (1 snapshot)/wmf/data/wmf/wikidata/entity/snapshot=XXX108.9Gb0.11Tb
Wikidata-item-page-link (1 snapshot)/wmf/data/wmf/wikidata/item_page_link/snapshot=XXX4.5Gb0.005Tb

Total size in Tb: 51+Tb

Event Timeline

IMO we should add what is not re-computable. Here is what comes to my mind:

  • aqs (stats data of AQS usage)
  • browser-general
  • events (the ones that are not deleted, and 90 days of all the rest?)
  • interlanguage data
  • mediacounts
  • mediarequests
  • mobile-apps sessions
  • mobile-apps uniques
  • pageview
  • unique-devices (all 4 subfolders)
  • virtualpageview

Things I have not listed: Report-updater generated data ( iwent for oozie only)

Milimetric subscribed.

(can do a quick checksum to check the main cluster data against the little backup cluster data)

Ping @mforns - Could you please review reportupdater data and tell us which should be backedup (if not already done somewhere else) ?

Other things not to forget:

  • event data that we don't want to loose (for instance mediawiki events) - @Ottomata do we have an easy way to get this list ?
  • druid indexed data, when it has been through sanitization - This actually leads to an interesting question: Is druid the last backup of some sanitized data, or do we always have sanitized events, in which case we don't need the druid data.

@JAllemandou, responding while Marcel is gone. I looked through and found a few of the reports rely on data that's changing (so if lost, would be hard/impossible to recompute). Why don't we just back up everything, the RU output is teeny tiny.

Why don't we just back up everything, the RU output is teeny tiny.

Happy with that solution - Let's start documenting sizes in the task body.

After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).
Here is the sizing I come up with (using useful size, not replicated one):

  • Total used size
hdfs dfs -du -s -h /670Tb
  • Folders not-to-backup size
hdfs dfs -du -s -h /var/log34Tb
hdfs dfs -du -s -h /wmf/data/raw140Tb
hdfs dfs -du -s -h /wmf/data/wmf/webrequest/*/year=2020/month=[78]83Tb
hdfs dfs -du -s -h /wmf/data/wmf/mediawiki/wikitext/history53Tb

Totals

670 - (34 + 140 + 83 + 53)360Tb

With a replication factor of 2: 720Tb real, plus some room for machines to breath.

@elukey : You told me we have ~700Tb this morning, and I can probably get a few Tb of some user folders by asking gently. Looks like it fits !

edited for clarity

Some notes about the amount of space that we could get with the backup cluster. We have the following hardware to use:

  • 16 nodes to decommission (analytics1042 -> analytics1057). Total space on each node is ~48T (some hosts might have less since a disk failed etc..)
  • 24 new nodes that still need to be racked/configured/etc... Total space on each node ~48T (hopefully all disks working :D)

We have to keep in mind that a new cluster needs two masters to function properly, so:

  1. we could use the 16 nodes to decom in a 2+14 fashion, but we'd get up to 48TB*14 = 672TB of total HDFS space to use (not sufficient)
  2. we could create a couple of beefy VMs and use all the 16 nodes, getting up to 48TB*16=768TB of total HDFS space to use (sufficient but close to the limit). Creating beefy ganeti VMs is also not advisable, not really sure also how much RAM will be needed for the namenode but surely something like 12/16G for each VM (bare min).
  3. we could decom the 16 old nodes, and use directly the 24 that we are getting for expansion. For example, we could add 4 nodes to the hadoop cluster to get more space, leaving 20 for the backup cluster. We'd use them in a 2+18 fashion, getting up to 48TB*18=864TB of total HDFS space (more than needed, but with extra space in case we want to backup more). After the two HDFS upgrades (since we'll have two versions of Bigtop to test/upgrade to), we could reimage all the backup cluster nodes and add them to the hadoop cluster.

Option 3) is appealing, since we'd also have a smaller cluster to upgrade (less nodes == a quicker procedure etc..). My main concern though is long term, namely when we'll have to do other upgrades with more data (for example, Hadoop 3!). In that case, we'll probably not have a backup cluster available, so what to do? I didn't hear anybody doing HDFS upgrades mentioning backups in talks/internet/etc.., I guess because of this problem. Worth to discuss in my opinion..

Me and Joseph had a chat about possible solutions/scenarios and we came up with this:

  • we start decomming the analytics1042 -> analytics1057, it should take a couple of weeks (one node per day)
  • we work with DC ops to rack/configure the new 24 nodes asap

The first of the two that gets to a good size (say 10/12 nodes) becomes that backup cluster, running bigtop, that we'll use to test/start copying data from our prod cluster. The idea is to start this copy/test process asap, since there will surely be some things to follow up.

After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).

You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions

You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions

"unprocessed" as in raw-from-camus (the /wmf/data/raw doesn't get backed-up). /wmf/data/event and /wmf/data/event_sanitized are both planed to be backed-up.

Ok after a quick chat with Dan it appears my way of presenting might have been confusing. Here are some clarification (hopefully?)
We wish to backup everything except

  • raw data, meaning all of /wmf/data/raw. Data not backed up includes raw data from camus (webrequest, events, netflow), dumps copied from labstore, sqooped data.
  • 2 month of refined webrequest (oldest) as in /wmf/data/wmf/webrequest/*/year=2020/month=[78]
  • processed wikitext, as it is huge and can be regenerated (even if long)

@Milimetric - Can you please confirm it's clearer ?

razzi triaged this task as High priority.
razzi raised the priority of this task from High to Needs Triage.
razzi triaged this task as High priority.
razzi added a project: Analytics-Kanban.
razzi moved this task from Incoming to Operational Excellence on the Analytics board.