Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3332186.3333156acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

CHURP: A Lightweight CLI Framework to Enable Novice Users to Analyze Sequencing Datasets in Parallel

Published: 28 July 2019 Publication History

Abstract

Progressive decreases in the cost of DNA sequencing have contributed to a decades-long exponential increase in the production of new sequencing datasets. The processing of these datasets has in turn led biology, a field that has traditionally relied on local "lab" servers to address its computational needs, to become increasingly reliant on High Performance Computing (HPC) resources. Though many operations on sequencing datasets are trivially parallelizable on multiple levels, the lack of an HPC tradition in biological research has hampered fully parallelized deployments.
Here we present a lightweight flexible framework for performing parallelized processing of raw gene expression data. The framework uses a Python3 based frontend for specifying analysis options, data paths, and reference datasets. This frontend sanitizes and resolves the options, providing verbose error checking before writing a human readable configuration file and basic scripts for batch submission. The submission scripts leverage the scheduler to implement a scatter-gather approach, submitting potentially hundreds of individual jobs via a job array, each small enough to take advantage of backfill in a high contention HPC environment. The gather component is handled through a script submitted with an "after-okay" dependency.

References

[1]
Enis Afgan, Dannon Baker, Marius van denÂăBeek, Daniel Blankenberg, Dave Bouvier, Martin ÄŇech, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, BjÃűrn GrÃijning, Aysam Guerler, Jennifer Hillman-Jackson, Greg Von Kuster, Eric Rasche, Nicola Soranzo, Nitesh Turaga, James Taylor, Anton Nekrutenko, and Jeremy Goecks. 2016. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research 44, W1 (July 2016), W3--W10.
[2]
J. J. Allaire, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2018. rmarkdown: Dynamic Documents for R. https://rmarkdown.rstudio.com
[3]
S Andrews. 2014. FastQC A Quality Control tool for High Throughput Sequence Data.
[4]
Anthony M. Bolger, Marc Lohse, and BJoern Usadel. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 15 (Aug. 2014), 2114--2120.
[5]
Arie B. Brinkman, Femke Simmer, Kelong Ma, Anita Kaan, Jingde Zhu, and Hendrik G. Stunnenberg. 2010. Whole-genome DNA methylation profiling using MethylCap-seq. DNA Methylation Analysis 52, 3 (Nov. 2010), 232--236.
[6]
Peter J. Campbell, Philip J. Stephens, Erin D. Pleasance, Sarah O'Meara, Heng Li, Thomas Santarius, Lucy A. Stebbings, Catherine Leroy, Sarah Edkins, Claire Hardy, Jon W. Teague, Andrew Menzies, Ian Goodhead, Daniel J. Turner, Christopher M. Clee, Michael A. Quail, Antony Cox, Clive Brown, Richard Durbin, Matthew E. Hurles, Paul A. W. Edwards, Graham R. Bignell, Michael R. Stratton, and P. Andrew Futreal. 2008. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genetics 40, 6 (June 2008), 722--729.
[7]
Murim Choi, Ute I. Scholl, Weizhen Ji, Tiewen Liu, Irina R. Tikhonova, Paul Zumbo, Ahmet Nayir, AyÈŹin Bakkaloħlu, Seza ÃŰzen, Sami Sanjad, Carol Nelson-Williams, Anita Farhi, Shrikant Mane, and Richard P. Lifton. 2009. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences 106, 45 (Nov. 2009), 19096.
[8]
DNAStar. (accessed May 17, 2019). DNAStar Genomics Suite. https://www. dnastar.com/software/genomics/
[9]
Terrence S. Furey. 2012. ChIPâĂŞseq and beyond: new and improved methodologies to detect and characterize proteinâĂŞDNA interactions. Nature Reviews Genetics 13, 12 (Dec. 2012), 840--852.
[10]
Daehwan Kim, Ben Langmead, and Steven L. Salzberg. 2015. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 4 (April 2015), 357--360.
[11]
Jeremy Leipzig. 2017. A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics 18, 3 (May 2017), 530--536.
[12]
Joshua Z. Levin, Moran Yassour, Xian Adiconis, Chad Nusbaum, Dawn Anne Thompson, Nir Friedman, Andreas Gnirke, and Aviv Regev. 2010. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature methods 7, 9 (Sept. 2010), 709--715.
[13]
Yang Liao, Gordon K. Smyth, and Wei Shi. 2014. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics (Oxford, England) 30, 7 (April 2014), 923--930.
[14]
Nirav Merchant, Eric Lyons, Stephen Goff, Matthew Vaughn, Doreen Ware, David Micklos, and Parker Antin. 2016. The iPlant Collaborative: Cyberinfrastructure for Enabling Data to Discovery for the Life Sciences. PLOS Biology 14, 1 (Jan. 2016), e1002342.
[15]
Paul Muir, Shantao Li, Shaoke Lou, Daifeng Wang, Daniel J. Spakowicz, Leonidas Salichos, Jing Zhang, George M. Weinstock, Farren Isaacs, Joel Rozowsky, and Mark Gerstein. 2016. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 17, 1 (March 2016), 53.
[16]
QIAGEN. (accessed May 17, 2019). CLC Genomics Workbench 12.0. https://www.qiagenbioinformatics.com/
[17]
R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[18]
Mark D. Robinson, Davis J. McCarthy, and Gordon K. Smyth. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 1 (Jan. 2010), 139--140.
[19]
Kris A. Wetterstrand. (accessed May 17, 2019). DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). https://www.genome.gov/27541954/dna-sequencing-costs-data/
[20]
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, MercÃĺ Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C âĂŹt Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (March 2016).
[21]
Chao Xie and Martti T. Tammi. 2009. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10, 1 (March 2009), 80.

Cited By

View all
  • (2024)C/EBPβ deletion in macrophages impairs mammary gland alveolar budding during the estrous cycleLife Science Alliance10.26508/lsa.2023025167:10(e202302516)Online publication date: 18-Jul-2024
  • (2024)Sterile production of interferons in the thymus affects T cell repertoire selectionScience Immunology10.1126/sciimmunol.adp11399:97Online publication date: 26-Jul-2024
  • (2024)Multiomic analyses reveal new targets of polycomb repressor complex 2 in Schwann lineage cells and malignant peripheral nerve sheath tumorsNeuro-Oncology Advances10.1093/noajnl/vdae1886:1Online publication date: 9-Nov-2024
  • Show More Cited By

Index Terms

  1. CHURP: A Lightweight CLI Framework to Enable Novice Users to Analyze Sequencing Datasets in Parallel

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)
      July 2019
      775 pages
      ISBN:9781450372275
      DOI:10.1145/3332186
      • General Chair:
      • Tom Furlani
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 July 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      PEARC '19

      Acceptance Rates

      Overall Acceptance Rate 133 of 202 submissions, 66%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)42
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 26 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)C/EBPβ deletion in macrophages impairs mammary gland alveolar budding during the estrous cycleLife Science Alliance10.26508/lsa.2023025167:10(e202302516)Online publication date: 18-Jul-2024
      • (2024)Sterile production of interferons in the thymus affects T cell repertoire selectionScience Immunology10.1126/sciimmunol.adp11399:97Online publication date: 26-Jul-2024
      • (2024)Multiomic analyses reveal new targets of polycomb repressor complex 2 in Schwann lineage cells and malignant peripheral nerve sheath tumorsNeuro-Oncology Advances10.1093/noajnl/vdae1886:1Online publication date: 9-Nov-2024
      • (2024)Divergent immune microenvironments in two tumor nodules from a patient with mismatch repair-deficient prostate cancernpj Genomic Medicine10.1038/s41525-024-00392-19:1Online publication date: 22-Jan-2024
      • (2024)Multi-omic and multispecies analysis of right ventricular dysfunctionThe Journal of Heart and Lung Transplantation10.1016/j.healun.2023.09.02043:2(303-313)Online publication date: Feb-2024
      • (2023)Saracatinib synergizes with enzalutamide to downregulate AR activity in CRPCFrontiers in Oncology10.3389/fonc.2023.121048713Online publication date: 30-Jun-2023
      • (2023)Depleting CD103+ resident memory T cells in vivo reveals immunostimulatory functions in oral mucosaJournal of Experimental Medicine10.1084/jem.20221853220:7Online publication date: 25-Apr-2023
      • (2023)Type III interferon drives thymic B cell activation and regulatory T cell generationProceedings of the National Academy of Sciences10.1073/pnas.2220120120120:9Online publication date: 21-Feb-2023
      • (2023)Targeting the NF-κB pathway enhances responsiveness of mammary tumors to JAK inhibitorsScientific Reports10.1038/s41598-023-32321-013:1Online publication date: 1-Apr-2023
      • (2022)Systemic lipolysis promotes physiological fitness in Drosophila melanogasterAging10.18632/aging.20425114:16(6481-6506)Online publication date: 30-Aug-2022
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media