Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 2 3 4 5 6 7 8 9 10 11 12 13 Near chromosome-level and highly repetitive genome assembly of the snake pipefish Entelurus aequoreus (Syngnathiformes: Syngnathidae) Magnus Wolf1,2,3, Bruno Lopes da Silva Ferrette1, Raphael T. F. Coimbra1,2, Menno de Jong1, Marcel Nebenfuehr1,2, David Prochotta1,2, Yannis Schöneberg1,2, Konstantin Zapf1,2, Jessica Rosenbaum2, Hannah A. Mc Intyre2, Julia Maier2, Clara C.S. de Souza2, Lucas M. Gehlhaar2, Melina J. Werner2, Henrik Oechler2, Marie Wittekind2, Moritz Sonnewald4, Maria A. Nilsson1,5, Axel Janke1,2,5, Sven Winter1,2,6 1 Senckenberg Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, Germany 14 2 Institute for Ecology, Evolution, and Diversity, Goethe University, Frankfurt am Main, Germany 15 3 Institute for Evolution and Biodiversity, University of Münster, Münster, Germany 16 17 4 18 5 LOEWE-Centre for Translational Biodiversity Genomics (TBG), Frankfurt am Main, Germany 19 6 Research Institute of Wildlife Ecology, University of Veterinary Medicine, Vienna, Austria Senckenberg Research Institute, Department of Marine Zoology, Section Ichthyology, Frankfurt am Main, Germany 20 21 22 23 24 25 Authors for correspondence: Sven Winter | Email: Sven.Winter@senckenberg.de Magnus Wolf | Email: Magnus.Wolf@senckenberg.de Name e-mail ORCID Magnus Wolf (MW) magnus.wolf@senckenberg.de 0000-0001-9212-9861 Bruno L. S. Ferrette (BF) bruno.ferrette@senckenberg.de 0000-0002-3108-9867 Raphael T. F. Coimbra (RC) raphael.t.f.coimbra@gmail.com 0000-0002-6075-7203 Menno de Jong (MDJ) menno.de-jong@senckenberg.de 0000-0003-2131-9048 Marcel Nebenfuehr (MN) marcel.nebenfuehr@senckenberg.de 0000-0001-8802-2105 David Prochotta (DP) david.prochotta@senckenberg.de 0009-0000-6275-7752 Yannis Schöneberg (YS) yannis.schoeneberg@gmx.de 0000-0003-1113-973X Konstantin Zapf (KZ) konstantin.zapf@senckenberg.de Jessica Rosenbaum (JR) s5932510@stud.uni-frankfurt.de 0009-0008-2306-9015 Hannah A. Mc Intyre (HMI) s3630184@stud.uni-frankfurt.de 0009-0002-3275-5048 Julia Maier (JM) juliaomaier@googlemail.com Clara C.S. de Souza (CDS) s6795932@stud.uni-frankfurt.de Lucas M. Gehlhaar (LG) s6244190@stud.uni-frankfurt.de Melina J. Werner (MJW) s5033619@stud.uni-frankfurt.de 1 0009-0000-9560-8905 0009-0007-1081-009X bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Henrik Oechler (HO) henrik.oechler@uni-bayreuth.de 0009-0001-0413-8731 Marie Wittekind (MWI) s0097351@stud.uni-frankfurt.de 0009-0001-2443-7552 Moritz Sonnewald (MS) moritz.sonnewald@senckenberg.de 0000-0003-3042-8107 Maria A. Nilsson (MAN) 26 27 28 maria.nilsson-janke@senckenberg.de 0000-0002-8136-7263 Axel Janke (AJ) axel.janke@senckenberg.de 0000-0002-9394-1904 Sven Winter (SW) sven.winter@vetmeduni.ac.at 0000-0002-1890-0977 Abstract 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 The snake pipefish, Entelurus aequoreus (Linnaeus, 1758), is a slender, up to 60 cm long, northern Atlantic fish that dwells in open seagrass habitats and has recently expanded its distribution range. The snake pipefish is part of the family Syngnathidae (seahorses and pipefish) that has undergone several characteristic morphological changes, such as loss of pelvic fins and elongated snout. Here, we present a highly contiguous, near chromosome-scale genome of the snake pipefish assembled as part of a university master’s course. The final assembly has a length of 1.6 Gbp in 7,391 scaffolds, a scaffold and contig N50 of 62.3 Mbp and 45.0 Mbp and L50 of 12 and 14, respectively. The largest 28 scaffolds (>21 Mbp) span 89.7% of the assembly length. A BUSCO completeness score of 94.1% and a mapping rate above 98% suggest a high assembly completeness. Repetitive elements cover 74.93% of the genome, one of the highest proportions so far identified in vertebrate genomes. Demographic modeling using the PSMC framework indicates a peak in effective population size (50 – 100 kya) during the last interglacial period and suggests that the species might largely benefit from warmer water conditions, as seen today. Our updated snake pipefish assembly forms an important foundation for further analysis of the morphological and molecular changes unique to the family Syngnathidae. 47 48 49 Keywords 50 long reads, proximity-ligation scaffolding, genome annotation, demographic history, 51 repetitive elements 52 53 Introduction 54 The snake pipefish Entelurus aequoreus (Linnaeus 1758) is a member of the family 55 Syngnathidae, which currently includes over 300 species of seahorses and pipefishes [1]. The 2 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 56 species shares typical features with other pipefishes such as a unique, elongated body plan and 57 fused jaws [2]. However, unlike most pipefish, which are found in benthic habitats, the snake 58 pipefish inhabits more open and deeper seagrass environments and occurs even in pelagic 59 waters [2]. They are ambush predators on small crustaceans and other invertebrates, thereby 60 indirectly contributing to the overall biodiversity and stability of these fragile habitats [3]. 61 Adult snake pipefish are poor swimmers with small fins and rely on their elongated, thin 62 bodies for crypsis in eelgrass habitats [4–6]. 63 The snake pipefish historically ranged from the waters of Azores northwards to the 64 waters of Norway and Iceland, and eastward to the Baltic [7]. However, since 2003, the 65 species has expanded its distribution [8] into the arctic waters of Spitsbergen [9], the Barents 66 Sea, and the Greenland Sea [10]. Simultaneously, population sizes seem to increase within its 67 former range, as indicated by substantially increased catch rates [11, 12]. Several factors have 68 been proposed to cause this expansion and population growth, including rising sea 69 temperatures, an increased potential for long-distance dispersal of juveniles via ocean currents 70 [4, 7] and an increased reproductive success facilitated by the dispersal of invasive seaweeds 71 [6, 8–10, 13]. The latter explanation has been confirmed in local field experiments in the 72 northern Wadden Sea, suggesting a mutual co-occurrence of the invasive Japanese seaweed 73 (Sargassum muticum) and the snake pipefish [5]. Studies based on mtDNA marker regions 74 did not discern any population structure thus far and suggest a previous population expansion 75 in the Pleistocene ca. 50–100 kya [6]. Yet, a comprehensive analysis of demographic events is 76 better studied from genomic data, requiring a high-quality reference genome of ideally the 77 same species or at least a closely related one. 78 Previously, genomes of Syngnathidae have been used to study the evolution of highly 79 specialized morphologies and life-history traits unique to pipefishes and seahorses [14–16]. 80 The transition to male pregnancy was associated with major genomic restructuring events and 81 parallel modifications of the adaptive immune system. There is a remarkable variability in 3 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 82 genome sizes within the family, with estimates ranging from 350 Mbp to 1.8 Gbp [14]. The 83 major shifts in body shape are assumed to be related to gene-family loss and expansion events 84 and higher rates of protein and nucleotide evolution [16]. Genomic data using a direct 85 sequencing approach of ultra-conserved elements (UCEs) improved the understanding of the 86 phylogeny of pipefishes [15] and identified a likely radiation of the group in the waters of the 87 modern Indo-Pacific Ocean. Nevertheless, high-quality genomes of Syngnathidae are only 88 available for a few species, and according to the NCBI genome database, only 7% of the 89 known species diversity has genome sequences available. 90 A draft genome of the snake pipefish was previously assembled using a combination 91 of paired-end and mate-pair sequencing techniques, yielding an assembly with low continuity 92 (N50 3.5 kbp, BUSCO C: 21%) and a large difference between the estimated and assembled 93 genome sizes (1.8 Gbp vs. 557 Mbp) [14]. To obtain a higher quality, near chromosome-scale 94 genome assembly for the snake pipefish for future population genomic, conservation, and 95 evolutionary studies of fish, we used long-read sequencing technologies. This allows us to 96 gain insight into the genetic properties of the species and to perform demographic analysis 97 based on the PSMC framework [17]. The data generation and analyses presented here were 98 conducted during a six-week master course in 2021 at the Goethe University, Frankfurt am 99 Main, Germany. The concept of high-quality genome sequencing in a course setting has so far 100 yielded three reference-quality genomes of fish and has proven to be a successful approach to 101 introduce the technology to a new generation of scientists [18–21]. 102 103 104 Results and Discussion 105 Genome sequencing and assembly 4 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 106 PacBio’s continuous long reads (CLR) technology generated 401 Gbp of long-read 107 data in ~60 million reads with an N50 of 7.9 kb (Table 1). Illumina sequencing yielded 38 108 Gbp of standard short-read data in ~257 million reads with a mean length of 148 bp after 109 filtering. Sequencing of the Omni-C library generated 54.7 Gbp of raw short-read data. 110 The snake pipefish’s genome was assembled de novo to a total size of 1.7 Gbp. It 111 consisted of 2,204 scaffolds, with a scaffold N50 of 62 Mbp and an L50 of 11 (Table 1, Fig. 112 1A). The finalized assembly has 1.0 N’s per 100 kbp and a GC content of 38.84%. A BUSCO 113 completeness 114 actinopterygii_obd10 set, indicating high completeness of the assembly. Both long- and short- 115 read data mapped onto the assembly with high mapping rates of 98.6% and 99.5%, 116 respectively. HI-C mapping resulted in 28 larger scaffolds (Fig. 1B), indicating the near- 117 chromosome level of the de novo assembly as past karyotype estimations of other pipefish 118 and seahorses predicted 22 and 22-24 chromosomes, respectively [22–24]. The rest of the 119 genome comprises only smaller scaffolds and contigs, which may result from the high 120 amounts of repetitive regions described in the following section. Our Blobtools analysis of 121 both long- and short-read data (Fig. 1C+D) found no apparent signs of contamination, 122 although background noise of unknown origin was detected and removed in both datasets. assessment resulted in 94.1% complete core genes, given the 123 Variant calling resulted in ~301 million sites (including monomorphic sites), of which 124 ~1.3 million were found to be biallelic. Genome-wide heterozygosity was determined to be 125 0.387%, which is in line with other fish species [25, 26]. The GenomeScope results based on 126 short reads suggested a haploid genome size of 1.15 Gbp and an expected genome-wide 127 heterozygosity of 1%, around 362 Mbp shorter and 0.57% more heterozygous when compared 128 to the final assembly. This, again, might be explained by the high repeat content in the 129 genome. 130 Annotation 5 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 131 In total, 0.9 Gbp, or 74.93%, of the entire assembly, were identified as repetitive 132 during our de novo repeat-modeling and repeat-masking (Fig. 2). This high repeat content 133 contrasts that of other fish genomes [27], but is similar, although at a smaller scale, to the 134 closest relative Nerophis ophidion (65.7%) [14] and other genomes of syngnathid fish like 135 e.g., seadragons [28]. The first draft assembly of the snake pipefish had a repeat content of 136 57.2% [14] and our improved long-read assembly identified 17.7% additional repeats that 137 were missing from the previous assembly [14]. So far, among vertebrates, only the lungfish 138 Neoceratodus forsteri [29] has more transposable elements (TEs) than the snake pipefish. 139 The annotation of the genome featuring de novo and homology-based identification 140 approaches resulted in 33,202 genes with an average length of 13,828 bp. Each gene had on 141 average 7.32 exons and 6.25 introns with average lengths of 188 bp and 2,240 bp, 142 respectively. In total, we identified 243,038 exons and 207,467 introns within our annotation. 143 The total number of genes is ~30% larger compared to other annotated genomes in the order 144 of Syngnathiformes like, e.g., 23,458 for the tiger tail seahorse (Hippocampus comes) [16] or 145 24,927 for the greater pipefish (Syngnathus acus) [30] made by the NCBI Eukaryotic Genome 146 Annotation pipeline. Given that these two genomes are also considerably smaller, 492 Mbp 147 and 324 Mbp, respectively, it can be assumed that the large-scale genome increase in this 148 species also included many coding sequences. A high content of repetitive regions as well as a 149 lack of transcriptomic data might also have increased the number of false positive gene-calls; 150 however, a BUSCO completeness analysis of the predicted proteins resulted in 82.6% 151 complete sequences, of which only 6.8% were duplicated. 5.3% of the coding sequences 152 appeared fragmented, and 12.1% were missing from the actinopterygii_obd10 OrthoDB set. 153 A functional annotation resulted in hits for 89% of the predicted proteins. 154 Demographic inference 6 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 155 The demographic inference analysis of the snake pipefish genome using the PSMC 156 framework [17] traced population changes over the past 1 million years. Given the chosen 157 substitution rate and generation time, there was a steady increase in the effective population 158 size (Ne), starting at 15 thousand individuals 1 Mya, which peaked at an Ne of 250 thousand 159 individuals at 100 kya. Thereafter, Ne decreased until reaching 30 thousand individuals at 10 160 kya and stagnated until the end of the model. The previously suggested population expansion 161 during the Pleistocene (50 – 100 kya) was therefore confirmed with this model but was 162 followed by another population decline that wasn’t resolved by Braga Goncalves et al. [6]. 163 This result may point to a different conclusion as drawn by the authors, because the snake 164 pipefish might have resided in a comparable small population size during the Holocene and 165 only recently expanded its distribution, resulting in a large population with a high degree of 166 homogenization as observed by Braga Goncalves and colleagues [6]. Given that the presented 167 peak in population size parallels with the last interglacial period between the Penultimate 168 Glacial Period (135 – 192 kya [31]) and the last glacial period (present – 20 kya [32]), we 169 assume that the snake pipefish largely benefitted from the warmer water conditions during the 170 interglacial period as seen in the present range expansion. 171 Material & Methods 172 Sampling, DNA extraction, and sequencing 173 A single individual of Entelurus aequoreus (Linnaeus 1758) was caught by trawling during an 174 annual monitoring expedition to the Dogger Bank in the North Sea in July 2021 (trawl start 175 coordinates 54.993633, 2.940833; end coordinates 55.0077, 2.929867) with the permission of 176 the Maritime Policy Unit of the UK Foreign and Commonwealth Office. The study was 177 conducted in compliance with the ‘Nagoya Protocol on Access to Genetic Resources and the 7 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 178 Fair and Equitable Sharing of Benefits Arising from Their Utilization’. The sample was 179 initially frozen at -20 °C and later stored at -80 °C. 180 High molecular weight genomic DNA was extracted from muscle tissue, following the 181 protocol by Mayjonade et al. [33] with the addition of Proteinase K. We evaluated the 182 quantity and quality of the DNA with the Genomic DNA ScreenTape on the Agilent 2200 183 TapeStation system (Agilent Technologies), as well as with the Qubit® dsDNA BR Assay 184 Kit. 185 For long-read sequencing, a PacBio SMRT Bell continuous long read (CLR) library was 186 prepared using the SMRTbell Express Prep kit v3.0 kit (Pacific Biosciences – PacBio, Menlo 187 Park, CA, USA) and sequenced on the PacBio Sequel IIe platform. A proximity-ligation 188 library was compiled with muscle tissue following the Dovetail™ Omni-C protocol (Dovetail 189 Genomics, Santa Cruz, California, USA). In addition, a standard whole-genome 150 base pair 190 (bp) paired-end Illumina library was prepared using the NEBNext Ultra II library preparation 191 kit (New England Biolabs Inc., Ipswich, USA). Finally, the proximity ligation and the paired- 192 end library were shipped to Novogene (UK) for sequencing on the Illumina NovaSeq 6000 193 platform. 194 Pre-processing & Genome size estimation 195 The PacBio subreads were converted from BAM into FASTQ format using the PacBio 196 Secondary Analysis Tool BAM2fastx v.1.3.0 (https://github.com/PacificBiosciences/ 197 pbbioconda). Quality control, trimming, and filtering of the Illumina reads were performed 198 using fastp v0.23.1 [34] with the settings “-g -3 -l 40 -y -Y 30 -q 15 -u 40 -c -p -j -h -R -w N”. 199 To estimate the genome size of the snake pipefish, we performed k-mer profiling using the 200 standard short-read Illumina data. We first ran Jellyfish v2.3.0 [35] to generate a histogram of 201 k-mers with a length of 21 bp. Subsequently, we used this data to obtain a genome profile 8 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 202 using GenomeScope v2.0 [36]. We further tested alternative k-mer lengths between 17- and 203 25-mers which resulted in no meaningful differences of the estimated genome size except for 204 the 17-mer, which resulted in a smaller genome size estimation of ~500 Mbp. 205 206 Genome Assembly & polishing 207 We assembled the genome from the PacBio long-read data using WTDBG v.2.5 [37]. The 208 resulting assembly was first polished using the PacBio data with Flye v.2.9 [38], using 209 Minimap v.2.17 [39] for mapping, followed by two rounds of short-read polishing by 210 mapping reads onto the assembly with BWA-MEM v.0.7.17 [40] and error correction with 211 Pilon v1.23 [41]. 212 Assembly QC & Scaffolding 213 The polished assembly contigs were anchored into chromosome-scale scaffolds utilizing the 214 generated proximity-ligation Omni-C data. First, the data were mapped and filtered to the 215 assembly following the Arima Hi-C mapping pipeline used by the Vertebrate Genome Project 216 (https://github.com/VGP/vgpassembly/blob/master/pipeline/salsa/arima_mapping_pipeline.sh 217 ). In brief, reads were mapped using BWA-MEM v.0.7.17 [40], mapped reads were filtered 218 with samtools v.1.14 [42], and duplicated reads were removed with “MarkDuplicates” in 219 Picard v.2.26.10 (Broad Institute, 2019). The filtered mapped reads were then used for 220 proximity-ligation scaffolding in YaHs v.1.1 [43]. Gaps in the scaffolded assembly were 221 closed with TGS-GapCloser v.1.1.1 [44] using a subset (25%) of the PacBio subreads due to 222 computational constraints. To further improve the assembly's contiguity, scaffolding and gap- 223 closing were performed a second time using a different subset of PacBio reads for gap- 224 closing. 225 (https://github.com/lh3/seqtk) using the random number generator seeds 11 and 18. Gene set The PacBio read subsets were 9 generated with seqtk v.1.3 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 226 completeness was analyzed with BUSCO v.5.4.7 [45] using the Actinopterygii set of core 227 genes (actinopterygii_odb10). Assembly continuity was evaluated using QUAST v5.0.2 [46], 228 and mapping rates were assessed by Qualimap v2.2.1 [47]. BlobToolsKit v.4.0.6 [48] was 229 used to perform contamination screening. 230 Repeat landscape analysis & genome annotation 231 The TE annotation was done in three steps. First, we used RepeatMasker v4.1.5 [49] to 232 annotate, and hard-mask known Actinopterygii repeats from RepBase, which comprises a 233 database of eukaryotic repetitive DNA element sequences [50]. Secondly, a de novo library of 234 transposable elements was created from the hard-masked genome assembly using 235 RepeatModeler v2.0.4 [51] which includes RECON v1.08 [52], RepeatScout v1.0.6 [53], and 236 LTRharvest/LTR_retriever [54, 55]. Finally, predicted repeats were annotated with a second 237 run of RepeatMasker on the hard-masked assembly obtained in the first run. The results from 238 both RepeatMasker runs were then combined. A summary of transposable elements and the 239 relative abundance of repeat classes in the genome are shown in Table 2 and Fig. 2. 240 The genome was annotated using the BRAKER3 pipeline [56–61], combining a de novo gene 241 calling and a homology-based gene annotation. For protein references, we combined the 242 vertebrate-specific protein collection from OrthoDB and the protein collection of the greater 243 pipefish (Syngnathus acus) genome [30] made by the NCBI (see: GCF_901709675.1, last 244 accessed 12th Oct. 2023). To further filter genes based on the support of introns by extrinsic 245 homology evidence, we used TSEBRA [62] with an “intron_support=0.1”. The resulting set 246 of proteins was tested for completeness using BUSCO v.5.4.7 [45] in “protein mode” and run 247 against the Actinopterygii-specific set of core genes. Functional annotation was done using 248 InterProScan v5 [63]. 249 Variant calling & demographic inference 10 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 250 The preprocessed short reads were mapped to the final assembly using BWA-MEM v. 0.7.17 251 [40] followed by removal of duplicate reads with "MarkDuplicates" in Picard v.2.26.10 252 (Broad Institute, 2019) and evaluation of mapping quality using Qualimap v2.2.1 [47] . Indels 253 in the BAM files were first identified and then realigned with "RealignerTargetCreator" and 254 "IndelRealigner" 255 (https://gatk.broadinstitute.org/). Subsequently, samtools v.1.14 [42] was used to check and 256 remove unmapped, secondary, QC failed, duplicated, and supplementary reads keeping only 257 reads mapped in proper pairs in non-repetitive regions of the 28 chromosome-scale scaffolds. 258 Sambamba v 1.0.0 [64] was used to estimate site depth statistics. Minimum and maximum 259 thresholds for the global site depth were set to d ± (5 × MAD), where d is the global site depth 260 distribution median and MAD is the median absolute deviation. Variant calling was 261 performed using the bcftools v1.17 [65] commands "mpileup" and "call" [-m]. Variants were 262 then filtered with bcftools "filter" [-e "DP< d – (5 × MAD) || DP> d + (5 × MAD) || 263 QUAL<30"] removing sites with low quality and out of range depth. Finally, bcftools was 264 used to estimate the genome-wide heterozygosity as the proportion of heterozygous sites 265 using the "stats" command. 266 Long-term changes in effective population size (Ne) over time were estimated with the 267 Pairwise Sequentially Markovian Coalescent (PSMC) model [17] based on the diploid 268 consensus genome sequences generated by bcftools v1.17 [65] with the script “vcfutils.pl” 269 from the processed BAM files, as described above. Sites with read-depth up to a third of the 270 average depth or above twice each sample’s median depth and with a consensus base quality 271 < 30 were removed. PSMC was executed using 25 iterations with a maximum 2N0-scaled 272 coalescent time of 15, an initial θ/ρ ratio of 5, and 64 atomic time intervals (4 + 25 × 2 + 4 + 273 6) to infer the scaled mutation rate, the recombination rate, and the free population size 274 parameters, respectively. We performed 100 bootstrap replicates by randomly sampling with as part of the Genome 11 Analysis Toolkit (GATK) v3.8-1 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 275 replacement of 1 Mb blocks from the consensus sequence for all individuals. A mutation rate 276 (µ) of 1.7 x 10-9 per site per generation [66] and a generation length of 2.5 years [67] were 277 employed for plotting. 278 Availability of Supporting Data 279 The de novo genome and all underlying raw data were uploaded to NCBI under the 280 BioProject 281 JAVRRV000000000. All other data, including the repeat and gene annotation, was uploaded 282 to 283 https://dataview.ncbi.nlm.nih.gov/object/PRJNA1005573?reviewer=2i5vm98fdb4r9j0asoff8m 284 sn3] the PRJNA1005573, GigaDB repository: BioSample DOI:XXXXX. SAMN36988691, [Rawdata genome available for assembly review at 285 Author Contributions 286 MW, BF, MS, AJ, and SW designed the study. SW, JR, HMI, JM, CDS, LG, MJW, HO, and 287 MWI performed laboratory procedures and sequencing. MW, BF, RC, MDJ, MN, DP, YS, 288 KZ, JR, HMI, JM, CDS, LG, MJW, HO, MWI, MAN, and SW conducted bioinformatic 289 processing and analyses. All authors contributed to writing this manuscript. 290 List of Abbreviations 291 bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; CLR: continuous 292 long reads; Gbp: Gigabase pairs; kbp: kilobase pairs; kya: thousand years ago; Mbp: 293 megabase pairs; Mya: million years ago; Ne: effective population size; PacBio: Pacific 294 Biosciences; PSMC: Pairwise Sequentially Markovian Coalescent; TEs: transposable 295 elements; UCEs: ultra-conserved elements. 296 297 Conflict of Interest The authors declare that they have no competing interests. 12 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 298 299 Ethics Approval and Consent for Publication Not Applicable. 300 Acknowledgments 301 The present study is a result of the Centre for Translational Biodiversity Genomics (LOEWE- 302 TBG) and was supported through the program ‘LOEWE-Landes-Offensive zur Entwicklung 303 Wissenschaftlich-ökonomischer Exzellenz’ of Hesse’s Ministry of Higher Education, 304 Research, and the Arts. 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Froese R, Pauly D. FishBase. 2023. www.fishbase.org. Accessed 9 Aug 2023. Dawson C. Syngnathidae. In: Smith M, Heemstra P, editors. Smiths' sea fishes. Berlin: Springer-Verlag; 1986. p. 445–458. O'Gorman EJ. Multitrophic diversity sustains ecological complexity by dampening top-down control of a shallow marine benthic food web. Ecology. 2021;102:e03274. doi:10.1002/ecy.3274. Vincent ACJ, Berglund A, Ahnesj I. Reproductive ecology of five pipefish species in one eelgrass meadow. Environ Biol Fish. 1995;44:347–61. doi:10.1007/BF00008250. Polte P, Buschbaum C. Native pipefish Entelurus aequoreus are promoted by the introduced seaweed Sargassum muticum in the northern Wadden Sea, North Sea. Aquat. Biol. 2008;3:11–8. doi:10.3354/ab00071. Braga Goncalves I, Cornetti L, Couperus AS, van Damme CJG, Mobley KB. Phylogeography of the snake pipefish, Entelurus aequoreus (Family: Syngnathidae) in the northeastern Atlantic Ocean. Biological Journal of the Linnean Society. 2017;122:787–800. doi:10.1093/biolinnean/blx112. Wheeler A. Key to the Fishes of Northern Europe: A guide to the identification of more than 350 species. London: Frederick Warne & Co. Ltd; 1978. Harris MP, Beare D, Toresen R, Nøttestad L, Kloppmann M, Dörner H, et al. A major increase in snake pipefish (Entelurus aequoreus) in northern European seas since 2003: potential implications for seabird breeding success. Mar Biol. 2007;151:973–83. doi:10.1007/s00227-006-0534-7. Fleischer D, Schaber M, Piepenburg D. Atlantic snake pipefish (Entelurus aequoreus) extends its northward distribution range to Svalbard (Arctic Ocean). Polar Biol. 2007;30:1359–62. doi:10.1007/s00300-007-0322-y. Rusyaev SM, Dolgov AV, Karamushko OV. Captures of snake pipefish Entelurus aequoreus in the Barents and Greenland Seas. J. Ichthyol. 2007;47:544–6. doi:10.1134/S0032945207070090. Kloppmann MHF, Ulleweit J. Off-shelf distribution of pelagic snake pipefish, Entelurus aequoreus (Linnaeus, 1758), west of the British Isles. Mar Biol. 2007;151:271–5. doi:10.1007/s00227-006-0480-4. van Damme CJ, Couperus AS. Mass occurrence of snake pipefish in the Northeast Atlantic: Result of a change in climate? Journal of Sea Research. 2008;60:117–25. doi:10.1016/j.seares.2008.02.009. Lindley J, Kirby R, Johns D, Reid C. Exceptional abundance of the snake pipefish (Entelurus aequoreus) in the north-eastern North Atlantic Ocean. ICES Document. 2006. Roth O, Solbakken MH, Tørresen OK, Bayer T, Matschiner M, Baalsrud HT, et al. Evolution of male pregnancy associated with remodeling of canonical vertebrate immunity in seahorses and pipefishes. Proc Natl Acad Sci U S A. 2020;117:9431–9. doi:10.1073/pnas.1916251117. Stiller J, Short G, Hamilton H, Saarman N, Longo S, Wainwright P, et al. Phylogenomic analysis of Syngnathidae reveals novel relationships, origins of endemic diversity and variable diversification rates. BMC Biol. 2022;20:75. doi:10.1186/s12915-022-01271-w. Lin Q, Fan S, Zhang Y, Xu M, Zhang H, Yang Y, et al. The seahorse genome and the evolution of its specialized morphology. Nature. 2016;540:395–9. doi:10.1038/nature20595. 13 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–6. doi:10.1038/nature10231. Prost S, Winter S, Raad J de, Coimbra RTF, Wolf M, Nilsson MA, et al. Education in the genomics era: Generating high-quality genome assemblies in university courses. Gigascience 2020. doi:10.1093/gigascience/giaa058. Prost S, Petersen M, Grethlein M, Hahn SJ, Kuschik-Maczollek N, Olesiuk ME, et al. Improving the Chromosome-Level Genome Assembly of the Siamese Fighting Fish (Betta splendens) in a University Master's Course. G3 (Bethesda). 2020;10:2179–83. doi:10.1534/g3.120.401205. Winter S, Prost S, Raad J de, Coimbra RTF, Wolf M, Nebenführ M, et al. Chromosome-level genome assembly of a benthic associated Syngnathiformes species: the common dragonet, Callionymus lyra. GigaByte. 2020;2020:gigabyte6. doi:10.46471/gigabyte.6. Winter S, Raad J de, Wolf M, Coimbra RTF, Jong MJ de, Schöneberg Y, et al. A chromosome-scale reference genome assembly of the great sand eel, Hyperoplus lanceolatus. J Hered. 2023;114:189–94. doi:10.1093/jhered/esad003. Vitturi R, Catalano E. Karyotypes in two species of the genusHippocampus (Pisces: Syngnatiformes). Mar Biol. 1988;99:119–21. doi:10.1007/BF00644985. Vitturi R, Libertini A, Campolmi M, Calderazzo F, Mazzola A. Conventional karyotype, nucleolar organizer regions and genome size in five Mediterranean species of Syngnathidae (Pisces, Syngnathiformes). Journal of Fish Biology. 1998;52:677–87. doi:10.1111/j.1095-8649.1998.tb00812.x. Small CM, Bassham S, Catchen J, Amores A, Fuiten AM, Brown RS, et al. The genome of the Gulf pipefish enables understanding of evolutionary innovations. Genome Biol. 2016;17:258. doi:10.1186/s13059-016-1126-6. Tigano A, Jacobs A, Wilder AP, Nand A, Zhan Y, Dekker J, Therkildsen NO. Chromosome-Level Assembly of the Atlantic Silverside Genome Reveals Extreme Levels of Sequence Diversity and Structural Genetic Variation. Genome Biol Evol 2021. doi:10.1093/gbe/evab098. Barry P, Broquet T, Gagnaire P-A. Age-specific survivorship and fecundity shape genetic diversity in marine fishes. Evol Lett. 2022;6:46–62. doi:10.1002/evl3.265. Shao F, Han M, Peng Z. Evolution and diversity of transposable elements in fish genomes. Sci Rep. 2019;9:15399. doi:10.1038/s41598-019-51888-1. Small CM, Healey HM, Currey MC, Beck EA, Catchen J, Lin ASP, et al. Leafy and weedy seadragon genomes connect genic and repetitive DNA features to the extravagant biology of syngnathid fishes. Proc Natl Acad Sci U S A. 2022;119:e2119602119. doi:10.1073/pnas.2119602119. Meyer A, Schloissnig S, Franchini P, Du K, Woltering JM, Irisarri I, et al. Giant lungfish genome elucidates the conquest of land by vertebrates. Nature. 2021;590:284–9. doi:10.1038/s41586-02103198-8. Scott-Somme K, McTierney S, Brittain R, Perry F, Brenen M. The genome sequence of the greater pipefish, Syngnathus acus (Linnaeus, 1758). Wellcome Open Res. 2023;8:274. doi:10.12688/wellcomeopenres.19528.1. Obrochta SP, Crowley TJ, Channell JE, Hodell DA, Baker PA, Seki A, Yokoyama Y. Climate variability and ice-sheet dynamics during the last three glaciations. Earth and Planetary Science Letters. 2014;406:198–212. doi:10.1016/j.epsl.2014.09.004. Armstrong E, Hopcroft PO, Valdes PJ. A simulated Northern Hemisphere terrestrial climate dataset for the past 60,000 years. Sci Data. 2019;6:265. doi:10.1038/s41597-019-0277-1. Mayjonade B, Gouzy J, Donnadieu C, Pouilly N, Marande W, Callot C, et al. Extraction of highmolecular-weight genomic DNA for long-read sequencing of single molecules. Biotechniques. 2016;61:203–5. doi:10.2144/000114460. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884-i890. doi:10.1093/bioinformatics/bty560. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011;27:764–70. doi:10.1093/bioinformatics/btr011. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–4. doi:10.1093/bioinformatics/btx153. Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 2020;17:155–8. doi:10.1038/s41592-019-0669-3. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–6. doi:10.1038/s41587-019-0072-8. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100. doi:10.1093/bioinformatics/bty191. Li H. Aligning seuquence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013. 14 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. doi:10.1371/journal.pone.0112963. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9. doi:10.1093/bioinformatics/btp352. Zhou C, McCarthy SA, Durbin R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 2023. doi:10.1093/bioinformatics/btac808. Xu M, Guo L, Gu S, Wang O, Zhang R, Peters BA, et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience 2020. doi:10.1093/gigascience/giaa094. Manni M, Berkeley MR, Seppey M, Zdobnov EM. BUSCO: Assessing Genomic Data Quality and Beyond. Curr Protoc. 2021;1:e323. doi:10.1002/cpz1.323. Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics. 2018;34:i142-i150. doi:10.1093/bioinformatics/bty266. Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32:292–4. doi:10.1093/bioinformatics/btv566. Challis R, Richards E, Rajan J, Cochrane G, Blaxter M. BlobToolKit – Interactive quality assessment of genome assemblies; 2019. Smit A, Hubley R, Green P. RepeatMasker Open-4.0. 2013. http://www.repeatmasker.org. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11. doi:10.1186/s13100-015-0041-9. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117:9451–7. doi:10.1073/pnas.1921046117. Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–76. doi:10.1101/gr.88502. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21 Suppl 1:i351-8. doi:10.1093/bioinformatics/bti1018. Ou S, Jiang N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 2018;176:1410–22. doi:10.1104/pp.17.01310. Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi:10.1186/1471-2105-9-18. Bruna T, Lomsadze A, Borodovsky M. GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistency with Extrinsic Data; 2023. Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3:lqaa108. doi:10.1093/nargab/lqaa108. Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:278. doi:10.1186/s13059-0191910-1. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32:767–9. doi:10.1093/bioinformatics/btv661. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. doi:10.1038/nmeth.3176. Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, Stanke M. BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv 2023. doi:10.1101/2023.06.10.544449. Gabriel L, Hoff KJ, Brůna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinformatics. 2021;22:566. doi:10.1186/s12859-021-04482-0. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–40. doi:10.1093/bioinformatics/btu031. Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics. 2015;31:2032–4. doi:10.1093/bioinformatics/btv098. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience 2021. doi:10.1093/gigascience/giab008. He L, Long X, Qi J, Wang Z, Huang Z, Wu S, et al. Genome and gene evolution of seahorse species revealed by the chromosome-level genome of Hippocampus abdominalis. Mol Ecol Resour. 2022;22:1465–77. doi:10.1111/1755-0998.13541. Schultz J. Entelurus aequoreus: IUCN Red List of Threatened Species, e.T18258072A44775951; 2014. 15 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 463 464 465 466 Figures 467 468 469 470 471 472 473 474 Figure 1 Assembly characteristics and quality assessments of the de novo Entelurus aequoreus genome. A The snail plot summarizes different assembly properties. Scaffold statistics are depicted in the innermost circle and the colors red to orange represent the longest scaffold N50 and N90, respectively. GC composition is shown in the outer blue circle. BUSCO completeness statistics are depicted in the small green circle. B Omni-C contact density map indicating 28 larger scaffolds and the near-chromosome level of the assembly. C-D The BlobPlot analysis compares GC content (x-axis), assembly coverage (y-axis) and taxonomic BLAST assignments of contigs (color) for both the Omni-C short reads (C) and PacBio long reads (D). 475 16 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 476 477 478 479 Figure 2 Repeat landscape of the de novo Entelurus aequoreus genome. Colors represent repetitive element types, gray areas indicate unclassified types of repetitive regions. 480 481 482 483 17 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 484 485 486 487 488 489 Figure 3 Demographic history of the snake pipefish estimated using the PSMC framework. Using a generation time of 2.5 years [67] and a substitution rate of 1.7x10-8 per site per generation [66] a model was created covering the last 10 kya to 1 Mya. The x-axis represents time in number of years ago and the y-axis shows the effective population (Ne) size in tens of thousands of individuals. The model indicates a peak in Ne of 250 thousand individuals during the Pleistocene at around 100 thousand years ago. 490 491 492 493 494 495 496 497 498 499 500 501 502 18 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 503 Tables 504 505 506 Table 1 Summary statistics of the snake pipefish reference genome. The table includes information for A the raw read sequencing, and B the scaffold- and contig-level de novo assembly and C the BUSCO completeness statistics. (A) Raw read statistics No. short reads 264,111,731 Mapped short reads (%) 99.53 Mean short read coverage (x) 23 No. long reads 130,590,372 Mapped long reads (%) 98.61 Mean long read coverage (x) 205.2 Assembly statistics (scaffold/contig) (B) No. scaffolds/contigs 7,387 7,473 No. scaffolds/contigs (>50 kbp) 466 526 scaffold/contig L50 12 14 62,341,166 45,010,074 1,662,053,046 1,662,035,846 GC (%) 38.87 38.87 No. of N's per 100 kb 1.03 0.0 heterozygosity (%) 0.387 scaffold/contig N50 (bp) Total length (bp) Total interspersed repeats (bp) 1,237,929,559 (74.93 %) (C) BUSCO completeness Clade: Actinopterygii C:94.1%[S:92.6%, D:1.5%] F:2.0%, M:3.9% n:3640 507 508 509 510 511 512 513 514 515 516 517 518 519 BUSCO: Benchmarking Universal Single Copy Orthologs (65); C, complete; S, single copy; D, duplicated; F, fragmented; M, missing. 19 bioRxiv preprint doi: https://doi.org/10.1101/2023.12.12.571260; this version posted December 13, 2023. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 520 521 522 523 524 525 Table 2 Repeat content of the genome assembly. Class, class of the repetitive regions. Count, number of occuences of the repetitive region. bpMasked, number of base pairs masked; %Masked, percentage of base pairs masked. LINE, Long Interspersed Nuclear Elements (include retroposons); LTR, Long Terminal Repeat elements (including retroposons); SINE, Short Interspersed Nuclear Elements; RC, Rolling Circle. Class Count bpMasked %masked 4 84 0.00% DNA 2765297 372407739 22.40% LINE 850222 167337419 10.06% LTR 177214 55439687 3.33% PLE 1 0 0.00% RC 32348 3385084 0.20% SINE 435464 32709572 1.95% Unknown 3628328 534216084 32.14% Low complexity 127733 3095322 0.19% Satellite 21221 7145469 0.43% 1437090 61077339 3.67% rRNA 4394 534599 0.03% scRNA 5 504 0.00% snRNA 695 46845 0.00% tRNA 6029 533812 0.03% Total 9486045 1237929559 74.93% ARTEFACT Simple repeat 526 20