De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks

Hiba Khodji ORCID: orcid.org/0000-0002-5525-4863¹,
Pierre Collet¹,
Julie D. Thompson¹ &
…
Anne Jeannin-Girardon¹

437 Accesses
1 Altmetric
Explore all metrics

Abstract

The widespread use of high throughput genome sequencing technologies has resulted in a significant increase in the number of available sequences, creating new challenges for genome annotation and prediction of protein-coding genes in terms of error detection and quality control. Multiple Sequence Alignments (MSAs) of the predicted protein sequences provide important contextual information that can be used to distinguish errors (caused by artifacts in the raw genome data, badly predicted gene sequences, or the alignment methods themselves) from true biological events. This can be achieved either by human expertise or by statistical analysis of the sequence data. Here, we propose a new approach that uses visual representations of MSAs as inputs for Convolutional Neural Networks (CNN) to classify MSAs into erroneous and non-erroneous categories. The MSAs are extracted from a unique in-house dataset, in which errors are carefully identified. Our model, called De-MISTED (Deep learning for MultIple Sequence alignmenTs Error Detection) identifies MSAs containing erroneous sequences with high accuracy (87%) and sensitivity (92%). Visual explanation techniques show that our model correctly identifies the position of multiple errors of different types (insertions, deletions and mismatches). Close examination of the data showed that our model can also identify errors that were not previously annotated in the data. The De-MISTED method thus contributes to a more robust exploitation of the genome data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

MERLIN: Identifying Inaccuracies in Multiple Sequence Alignments Using Object Detection

Genomic benchmarks: a collection of datasets for genomic sequence classification

Article Open access 01 May 2023

Interpreting neural networks for biological sequences by learning stochastic masks

Article 25 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The dataset generated for the current study is available in the Zenodo repository: https://doi.org/10.5281/zenodo.6637475.

Notes

The filtering protocol is a simple in-house program that takes as input an erroneous MSA in XML format and filters out erroneous sequences which are defined by specific start and end tags

References

Aoki G, Sakakibara Y (2018) Convolutional neural networks for classification of alignments of non-coding RNA sequences. Bioinformatics 34:i237–i244
Article Google Scholar
Carroll H, Beckstead W, O’Connor T et al (2007) Dna reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinform (Oxford England) 23:2648–9. https://doi.org/10.1093/bioinformatics/btm389
Article Google Scholar
Chatzou M, Magis C, Chang JM et al (2015) Multiple sequence alignment modeling: methods and applications. Brief Bioinform 2015. https://doi.org/10.1093/bib/bbv099
Chiner-Oms A, González-Candelas F (2016) Evalmsa: A program to evaluate multiple sequence alignments and detect outliers. Evol Bioinform 12:EBO.S40,583. https://doi.org/10.4137/EBO.S40583
Article Google Scholar
Consortium TU (2018) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47 (D1):D506–D515. https://doi.org/10.1093/nar/gky1049
Article Google Scholar
Corpet F, Servant F, Gouzy J et al (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res 28:267–9. https://doi.org/10.1093/nar/28.1.267
Article Google Scholar
DeBlasio DF, Kececioglu J (2018) Adaptive local realignment of protein sequences. J Comput Biol J Comput Mol Cell Biol 25(7):780–793
Article MathSciNet Google Scholar
Dragan MA, Moghul I, Priyam A et al (2016) Genevalidator: Identify problems with protein-coding gene predictions. Bioinform 32. https://doi.org/10.1093/bioinformatics/btw015
Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16 (3):368–73
Article Google Scholar
Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(D1):D222–D230. https://doi.org/10.1093/nar/gkt1223, https://arxiv.org/abs/https://academic.oup.com/nar/article-pdf/42/D1/D222/3643441/gkt1223.pdf
Article Google Scholar
Gibbs R, Rogers J, Katze M et al (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science 316:222–34
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2016.90, pp 770–778
Jafari R, Javidi M, Kuchaki Rafsanjani M (2019) Using deep reinforcement learning approach for solving the multiple sequence alignment problem. SN Appl Sci 1. https://doi.org/10.1007/s42452-019-0611-4
Jehl P, Sievers F, Higgins D (2015) OD-seq: Outlier detection in multiple sequence alignments. BMC Bioinforma 16:269. https://doi.org/10.1186/s12859-015-0702-1
Article Google Scholar
Kanz C, Aldebert P, Althorpe N et al (2005) The embl nucleotide sequence database. Nucleic Acids Res 33:D29–33. https://doi.org/10.1093/nar/gki098
Article Google Scholar
Katoh K, Standley D, Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol 30:772–780. Molecular biology and evolution 30. https://doi.org/10.1093/molbev/mst010
Article Google Scholar
Katoh K, Misawa K, Ki Kuma et al (2002) MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res 30:3059–66
Article Google Scholar
Khenoussi W, Vanhoutreve R, Poch O et al (2014) SIBIS: A Bayesian model for inconsistent protein sequence estimation. Bioinform (Oxford England) 30. https://doi.org/10.1093/bioinformatics/btu329
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90
Article Google Scholar
Larkin M, Blackshields G, Brown N et al (2007) Clustal W and clustal X version 2.0. Bioinformatics 23:2947–2948
Article Google Scholar
Meyer C, Scalzitti N, Jeannin-Girardon A et al (2020) Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes. BMC Bioinforma 21
Mircea IG, Bocicor I, Czibula G (2018a) A reinforcement learning based approach to based approach multiple sequence alignment. In: Balas VE, Jain LC, Balas MM (eds) Soft computing applications. Springer International Publishing, Cham, pp 54– 70
Chapter Google Scholar
Mircea I-G, Bocicor M-I (2014) On reinforcement learning based multiple sequence alignment
Nagy A, Patthy L (2013) MisPred: A resource for identification of erroneous protein sequences in public databases. Database J Biol Databases Curation 2013:bat053. https://doi.org/10.1093/database/bat053
Article Google Scholar
Nagy A, Patthy L (2014) Fixpred: a resource for correction of erroneous protein sequences. Database: The Journal of Biological Databases and Curation
Nagy A, Hegyi H, Farkas K et al (2008) Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinforma 9:353. https://doi.org/10.1186/1471-2105-9-353
Article Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–17
Article Google Scholar
O’Leary NA, Wright MW, Brister JR et al (2015) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44(D1):D733–D745. https://doi.org/10.1093/nar/gkv1189. https://arxiv.org/abs/https://academic.oup.com/nar/article-pdf/44/D1/D733/9482930/gkv1189.pdf
Article Google Scholar
Pearson W (2004) Finding protein and nucleotide similarities with fasta. Current protocols in bioinformatics / editoral board. Andreas D Baxevanis [others] Chapter 3. https://doi.org/10.1002/0471250953.bi0309s04
Prosdocimi F, Linard B, Pontarotti P et al (2011) Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 13:5–5
Article Google Scholar
Rajpurkar P, Irvin J, Zhu K et al (2017) Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv:http://arxiv.org/abs/1711.05225
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Article MathSciNet Google Scholar
Scalzitti N, Jeannin-Girardon A, Collet P et al (2020) A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 21:293. https://doi.org/10.1186/s12864-020-6707-9
Article Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:http://arxiv.org/abs/1409.1556
Srivastava N, Hinton GE, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Google Scholar
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Chapter Google Scholar
Tamura K, Stecher G, Peterson D et al (2013) MEGA6: Molecular evolutionary genetics analysis version 6.0. Mol Biol Evol 30 https://doi.org/10.1093/molbev/mst197
Thompson J, Higgins D, Gibson T (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–80
Article Google Scholar
Thompson J, Plewniak F, Poch O (1999) Balibase: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinforma (Oxford England) 15:87–8. https://doi.org/10.1093/bioinformatics/15.1.87
Article Google Scholar
Thompson J, Plewniak F, Ripp R et al (2001) Towards a reliable objective function for multiple sequence alignments. J Mol Biol 314:937–951. https://doi.org/10.1006/jmbi.2001.5187
Article Google Scholar
Thompson J, Thierry JC, Poch O (2003) Rascal: Rapid scanning and correction of multiple sequence alignments. Bioinforma (Oxford England) 19:1155–61. https://doi.org/10.1093/bioinformatics/btg133
Article Google Scholar
Thompson JD (2016) Statistics for bioinformatics : methods for multiple sequence alignment. iSTE Press
Thompson JD, Linard B, Lecompte O et al (2011) A comprehensive benchmark study of multiple sequence alignment methods: Current challenges and future perspectives. PLoS ONE 6
Tong J, Pei J, Otwinowski Z et al (2014) Refinement by shifting secondary structure elements improves sequence alignments. Proteins Struct Funct Bioinform 83. https://doi.org/10.1002/prot.24746
Vanhoutreve R, Kress A, Legrand B et al (2016) LEON-BIS: Multiple alignment evaluation of sequence neighbours using a bayesian inference system. BMC Bioinforma 17. https://doi.org/10.1186/s12859-016-1146-y
Wang H, Wang Z, Du M et al (2020) Score-CAM: Score-weighted visual explanations for convolutional neural networks. In: 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 111–119
Chapter Google Scholar
Wang Y, Wu H, Cai Y (2018) A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics 19. https://doi.org/10.1186/s12859-018-2524-4
Warnow T (2021) Revisiting evaluation of multiple sequence alignment methods. Humana Press Inc., pp 299–317. Methods in Molecular Biology, https://doi.org/10.1007/978-1-0716-1036-7_17
Xuyu X, Dafan Z, Qin J et al (2010) Ant colony with genetic algorithm based on planar graph for multiple sequence alignment. Inf Technol J 9. https://doi.org/10.3923/itj.2010.274.281
Yosinski J, Clune J, Bengio Y et al et al (2014) How transferable are features in deep neural networks?. In: Ghahramani Z, Welling M, Cortes C (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2014/file/375c71349b295fbe2dcdca9206f20a06-Paper.pdf, vol 27. Curran Associates Inc
Zaal D, Nota B (2015) Adoma: A command line tool to modify clustalw multiple alignment output. Mol Inform 35. https://doi.org/10.1002/minf.201500083
Zhang C, Zheng W, Mortuza S et al (2019) DeepMSA: Constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinform (Oxford England) 36. https://doi.org/10.1093/bioinformatics/btz863

Download references

Acknowledgements

The authors would like to thank the BiGEst bioinformatics platform for technical support. This work was supported by the French Infrastructure Institut Français de Bioinformatique (IFB) ANR-11-INBS-0013, ANR ArtIC ANR-20-THIA-0006 and Institute funds from the French Centre National de la Recherche Scientifique and the University of Strasbourg.

Author information

Authors and Affiliations

University of Strasbourg, ICube Laboratory UMR7357, 1 Rue Eugène Boeckel, 67000, Strasbourg, France
Hiba Khodji, Pierre Collet, Julie D. Thompson & Anne Jeannin-Girardon

Authors

Hiba Khodji
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Collet
View author publications
You can also search for this author in PubMed Google Scholar
Julie D. Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Anne Jeannin-Girardon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiba Khodji.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

We include supplementary material providing additional Score-CAM [47] results obtained by our proposed models (A) and (B).

Pierre Collet, Julie D. Thompson and Anne Jeannin-Girardon contributed equally to this work.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(3.70 MB)

(8.15 MB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Khodji, H., Collet, P., Thompson, J.D. et al. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. Appl Intell 53, 18806–18820 (2023). https://doi.org/10.1007/s10489-022-04390-7

Download citation

Accepted: 05 December 2022
Published: 09 February 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10489-022-04390-7

De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MERLIN: Identifying Inaccuracies in Multiple Sequence Alignments Using Object Detection

Genomic benchmarks: a collection of datasets for genomic sequence classification

Interpreting neural networks for biological sequences by learning stochastic masks

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Electronic supplementary material

(3.70 MB)

(8.15 MB)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MERLIN: Identifying Inaccuracies in Multiple Sequence Alignments Using Object Detection

Genomic benchmarks: a collection of datasets for genomic sequence classification

Interpreting neural networks for biological sequences by learning stochastic masks

Explore related subjects

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Electronic supplementary material

(3.70 MB)

(8.15 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now