You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run the new version of vsearch with sintax on a computing cluster, but the processing was extremely slow despite the large amount of computing resources requested (4775 MB per core) and threading (40 cores). The input ASV fasta file is 4.6MB for 10,710 ASVs, and the reference database is the complete Eukaryote COI BOLD database (1.7GB, 2216285 sequences).
vsearch ran for 13 days, but only outputed a 72.7KB one column file, which seem to indicate that only 6236 ASVs were processed. Below is the head and tail of the output file:
Here is the script for the .sh file used to run vsearch:
`#!/bin/bash
#SBATCH --mem-per-cpu=4775M
#SBATCH --cpus-per-task=40
#SBATCH --time=48:00:00
#SBATCH --account=def-mcristes
#SBATCH --mail-user=mathilde.salamon@mcgill.ca
#SBATCH --mail-type=ALL
Both the time used and the lack of any results for most of the sequences look very strange. I have therefore tried to reproduce your efforts and downloaded the SINTAX_COI_v5.1.0ref.fasta file from the https://github.com/terrimporter/CO1Classifier repository.
It seems like the problem is related to masking of the sequences in the database. By default, vsearch applies "soft masking" to the sequences in the databases. That means that all lower case letters are masked and not used during the initial stage of sequence comparison. It is described in the manual, but it is not mentioned for the sintax command, so we need to improve the documentation. Perhaps it should not even be applied by default for this command. Since the database seems to only contain lower case letters for the nucleotide symbols, all of the sequences are masked, leaving no results.
I am sorry that you have wasted 13 days of computation time (times 40 cpus) with this. The good news is that this problem can be easily resolved by including the --dbmask none option on the command line. When I did this with 10710 randomly subsampled sequences from the same database, the whole run completed in under 10 minutes using 8 threads and less than 6GB memory on my Macbook. And the results looked reasonable.
thank you very much for your quick response, explanation, and for running the test, it was very insightful ! I'm glad this is such an easy fix, because I was planning to use vsearch with sintax for all my databases.
Hello,
I tried to run the new version of vsearch with sintax on a computing cluster, but the processing was extremely slow despite the large amount of computing resources requested (4775 MB per core) and threading (40 cores). The input ASV fasta file is 4.6MB for 10,710 ASVs, and the reference database is the complete Eukaryote COI BOLD database (1.7GB, 2216285 sequences).
vsearch ran for 13 days, but only outputed a 72.7KB one column file, which seem to indicate that only 6236 ASVs were processed. Below is the head and tail of the output file:
ASV_7
ASV_20
ASV_16
ASV_17
ASV_10
ASV_19
ASV_12
ASV_34
ASV_35
ASV_9
...
ASV_6228
ASV_6229
ASV_6230
ASV_6231
ASV_6232
ASV_6233
ASV_6234
ASV_6235
ASV_6236
Here is the script for the .sh file used to run vsearch:
`#!/bin/bash
#SBATCH --mem-per-cpu=4775M
#SBATCH --cpus-per-task=40
#SBATCH --time=48:00:00
#SBATCH --account=def-mcristes
#SBATCH --mail-user=mathilde.salamon@mcgill.ca
#SBATCH --mail-type=ALL
module load StdEnv/2020 vsearch/2.28.1
Run VSEARCH
vsearch --sintax ASVs_Malaise_traps_DADA2.fasta
--sintax_random
--db SINTAX_COI_v5.1.0ref.fasta
--tabbedout rdp_sintax_unoise3_COI.txt
--sintax_cutoff 0.8
--strand both
--threads 40
--log sintax_COI_MalaiseTraps_log.txt`
I am unsure why the program was so slow, could this be due to the very large reference database ?
Thank you for your help,
Best wishes,
Mathilde Salamon
The text was updated successfully, but these errors were encountered: