4 Managingdata

Download as pdf or txt
Managing data

Joachim Jacob 8 and 15 November 2013

Bioinformatics data
Historically, bioinformatics has al ays !sed te"t files to store data#
&'B file e"cer(t

$enban% record

HMM (rofile

N$) data
*he N$) machines s(it a lot of data, stored in plain text files# *hese files are m!lti(le gigabytes in si+e#

*i(s for managing N$) data

1# ,hen yo! move the data, do it in its smallest form# Compress the data# 2# ,hen yo! !n(ac% the data, leave it

here it is#

Symbolic links (oint to the data in different folders#

3# &rovide eno!gh storage for yo!r data#

choose yo!r file system type isely

-om(ression. tools in /in!"

2nd some more e"ist###



,idely !sed com(ression tools. $N3 +i( 4gzip5 Bloc% )orting com(ression 4 bzip25 *y(ically, com(ression tools or% on one file# Ho to com(ress directories and their contents6


itho!t com(ression

*ar 4*a(e 2rchive5 is a tool for bundling a set of files or directories into a single archive. *he res!lting file is called a tar ball# )ynta" to create a tarball. $ tar -cf archive.tar file1 file2 )ynta" to e"tract. $ tar -xvf /path/to/archive.tar

-om(ression. a ty(ical case

2rchiving and com(ression mostly occ!r together# *he most !sed formats are tar.gz or *hese files are the res!lt of two (rocesses#

Archiving 4tar5
Compressing 4g+i( or b+i(25

-om(ression. on yo!r des%to(

-om(ression. on the command line

Tar is the tool for creating #tar archives, b!t it can com(ress in one go, ith the + or 7 o(tion# Creating a com(ressed tar archive. $ tar cvfz mytararchive.tar.gz $ tar cvfj
create -om(ression techni8!e

docs/ docs/

Decompressing a com(ressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj
e"tract files verbose

*o com(ress one or more files. $ gzip [options] file $ bzip2 [options] file *o decom(ress one or more files. $ gunzip [options] file(s) $ bunzip2 [options] file(s)

Many com(ression tools on the command line allo to read compressed files 4instead of first !n(ac%ing then reading5#

$ zcat file(s) $ bzcat file(s)

-om(ression is al ays a balance bet een time and com(ression ratio# $+i( is faster, b+i(2 com(resses harder# :f com(ression is im(ortant to yo!. benchmar%;

a little com(ression e"ercise#

&ay attention# )omething very convenient; 2 symbolic link 4or symlin%5 is a file hich (oints to the location of the lin%ed9to file# =o! can do anything ith the symlin% that yo! can do on the original file# 2s yo! move the original file from its location, the symlin% is >dead>#
'o nloads0


2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam

*o create a symlin%, move to the folder in m!st be created, and e"ec!te ln # here the symlin%

~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam

'o nloads0


2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam

*he symlin% is created# =o! can chec% *o delete a symlin%, !se unlin#. ith ls.

~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam ~/Butterfly $ ls -lh "in#$to$alignment.sam lr%xr%xr%x & joachim joachim '' (ct )) &'*'+ "in#$to$alignment.sam -, ../ e!uences/alignment.sam

'o nloads0


2nnotation0 @ice0 )e8!ences0 alignment.sam

B!tterfly0 ink!to!alignment.sam

a little symlin% e"ercise

'is%s and storage

:f yo! dive into bioinformatics, yo! manage dis%s and storage# * o ty(es of dis%s " solid state disks /o ca(acity, high s(eed, random " spinning hard disks High ca(acity, >normal> s(eed, se8!ential rites#
htt(.00en# i%i(edia#org0 i%i0)olid9stateAdrive htt(.00en# i%i(edia#org0 i%i0HardAdis%

ill have to


2 dis% is a device
Bia the terminal, sho the dis%s !sing

$ sudo fdis# -l -sudo. /ass%ord for joachim* 0is# /dev/sda* &1.' 2B3 &1'45&'&1&) bytes ... 0is# /dev/sdb* 166+ 7B3 166+&+&+&) bytes ...

2 dis% is divided into (artitions

2 dis% can be divided in (arts, called (artitions# 2n internal disk hich r!ns an o(erating system is !s!ally divided in (artitions, one for each f!nctions# 2n external disk is !s!ally not divided in (artitions#

-hec% o!t the dis% !tility tool

*he system dis%

Name of the dis%

*he system dis%

Name c!rrently highlighted (artition

*he system dis%

&lace in the directory str!ct!re here the (artition can be accessed

2n e"am(le of an 3)B dis%


&lace in the directory str!ct!re here the (artition can be accessed

2n e"am(le of an 3)B dis%

*he 3)B dis% is >mo!nted> a!tomatically on the directory tree !nder #media#

2n e"am(le of an 3)B dis%

*his is the ty(e of file system on the (artition# *he (artition is said to be formatted in C2*32 4in this case5#

Cile system formats

By defa!lt, many 3)B flash dis%s are formatted in $AT%&# Dther ty(es are N*C), e"tE, FC)#
$AT%& G ma" E$B files 'T$S G ma"im!m (ortability 4also for !se !nder (xt) G defa!lt file system in /in!", indo s5

htt(.00en# i%i(edia#org0 i%i0CileAsystemHCileAsystemsAandAo(eratingAsystems

2n e"am(le of an 3)B dis%

Cirst !nmo!nt the device# Ne"t, choose format the device#


Cormat dis%s

ith dis% !tility

-hoose the ty(e of file system yo! ant to be on that device#

Cormat dis%s

ith dis% !tility

Cormat dis%s

ith dis% !tility


=o! don>t ant to %no all the commands that behind the gnome9dis%9!tility for yo!# B!t if yo! do. 9 mo!nt 9 !mo!nt 9 fdis% 9 m%fs

=o! can read the man (ages and search for g!ides on the internet if yo! ant to get to %no these 4o!t of sco(e for this co!rse5#

-hec%ing storage s(ace

By defa!lt >dis% !sage analy+er>#

-hec%ing storage s(ace

Bon!s. IE'ir)tat# Not installed by defa!lt#

-hec%ing storage s(ace

Bon!s. IE'ir)tat# Not installed by defa!lt#

IE'irstat is a I'< (ac%age

@ehearsal. hat is I'<6

Bon!s. hat ha((ens hen yo! install this (ac%age on o!r system6

)(ace left on dis%s

ith df

*o chec% the storage that is !sed on the different dis%s#

~/ $ df -h

8ilesystem /dev/sda& udev tm/fs none none /dev/sdb&

~/ $ df -h .

ize &)2 '647 )447 <.47 '657 1.52

9sed :vail 9se; 7ounted on <.12 <.+2 '6; / '.4= '647 &; /dev 6)4= &667 &; /run 4 <.47 4; /run/loc# +>= '657 &; /run/shm )47 1.+2 &; /media/test

*he si+e of directories

*o chec% the si+e of files or directories#
~/ $ du -sh ? <)4= bin )5&7 @om/ression$exercise '.4= 0es#to/ '.4= 0ocuments <.47 0o%nloads '.4= 7usic '.4= Pictures '.4= Public 1+17 Rice Axam/le '.4= Bem/lates '.4= test &+7 test.img &&'7 ugene-&.&).) '.4= Cideos

,ildcards on the command line

,ildcards are !sed to describe the names of files#dirs# +, Dn that (osition, the character may be one of the characters bet een J K, e#g# saniti+sz,ation matches. sanitisation and sanitization Dn that (osition, any character is allo ed# e#g# saniti-ation matches. sanitisation, sanitiration, ### L Dn that (osition, any length of string is allo ed e#g# s. matches. san, sdd, sanitisation, sam#alignment,###

,ildcards on the command line

Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o?

,ildcards on the command line

Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o? '.4= 0ocuments )42 0o%nloads

,ildcards on the command line

Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast!

,ildcards on the command line

Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast! ARR&'5<<)$&.fast!! ARR&'5<<)$&$/rinse!$good$! ARR&'5<<)$).fast!!

Iey ords
-om(ression 2rchive )ymbolic lin% mo!nting Cile system format (artition @ec!rsively df d! !nlin% ,rite in yo!r o n ords hat the terms mean


