Sample preparation, library construction, and sequencing
A Lithuanian human (NCBI:txid9606) female with three generations of ethnic family history was recruited for sequencing. Standard ethical procedures were applied by the Genome Research Foundation with IRB-REC- 20101202 – 001. The volunteer signed an informed consent agreement, and a 20 mL blood sample was drawn using heparinized needles and collected into anticoagulant-containing tubes (dipotassium ethylenediaminetetraacetic acid; K2 EDTA).
DNA was extracted from the single donor’s peripheral blood (5 mL) using a DNeasy Blood & Tissue Kit from QIAGEN, according to the manufacturer’s protocol. The quality and concentration of the extracted DNA were evaluated using NanoDrop™ One/OneC UV-Vis spectrophotometer (ThermoFisher Scientific™). Short-read whole genome sequencing and library construction were conducted by the Beijing Genomics Institute (BGI) on the BGISEQ-500 platform (RRID:SCR_017979) using DNBseq™ 100-basepair (bp) paired-end sequencing.
Sequencing libraries for long reads were prepared using the 1D ligation sequencing kit (SQK-LSK109) (Oxford Nanopore Technologies, UK) following the manufacturer’s instructions. The products were quantified using the Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA) and the raw signal data were generated on the PromethION R9.4.5 platform (Oxford Nanopore Technologies, UK). Base-calling from the raw signal data was carried out using a default ONT basecaller MinKNOW v19.05.1 with the Flip-Flop HAC (High Accuracy) model (Oxford Nanopore Technologies, UK).
Hi-C sequencing data generation
Hi-C chromosome conformation capture data were generated using the Arima-HiC kit (A160105 v01, San Diego, CA, USA) with two restriction enzymes. To prepare LT1 samples for Hi-C analysis, white blood cells from the donated blood were harvested and cross-linked, as instructed by the manufacturer. One million cross-linked cells were used as input in the Hi-C protocol. Briefly, chromatin from cross-linked cells or nuclei was solubilized and then digested using multiple restriction enzymes that digest chromatin at ∧GATC and G∧ANTC. Digested ends were labeled using a biotinylated nucleotide and ligated. Ligation products were purified, fragmented, and size-selected using AMpure XP Beads. Biotinylated fragments were then enriched using Enrichment beads and Illumina-compatible sequencing libraries were constructed on end repair, dA-tailing, and adaptor ligation using a modified workflow of the Hyper Prep kit (KAPA Biosystems, Inc.). The bead-bound library was then amplified, and amplicons were purified using AMpure XP beads and subjected to deep sequencing. Sequencing of prepared Hi-C libraries was performed on the Illumina NovaSeq platform with read length of 150 bp by Novogene (Beijing, China).
De novo assembly of the LT1 genome
To generate the
de novo LT1 genome assembly, we prepared a bioinformatic pipeline including: a preprocessing step, contig assembly, map assembly, gene prediction, and post-analysis. The processes used in the pipeline are summarized in Figure
1.
Figure 1.
A bioinformatic pipeline for generating the LT1 genome assembly.
A total of 142.09 gigabase pairs (Gbp) of short paired-end genomic raw reads were produced by the BGISEQ-500 sequencer, which resulted in a 47× sequencing depth of coverage (Table
1). Adapter sequences were trimmed from raw reads using Trimmomatic v0.36 (RRID:
SCR_011848)
[24] with parameters as ‘ILLUMINACLIP:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 HEADCROP:15 MINLEN:60’. Sequences from vectors and microbial contaminants were removed using BBtools suite v38.96 (RRID:
SCR_016968)
[25] and a merged database from Refseq databases (the prokaryotic Refseq and viral Refseq databases). For error correction, we used the tadpole.sh program of BBtools. After preprocessing, 106.29 Gbp cleaned reads were obtained.
Table 1
Statistics of long and short reads whole genome sequencing for LT1.
Library type | Sequencing technology | Library name | Number of reads (n) | Total length of reads (bp) | NG50 (bp) |
---|
Short reads | BGISeq-500 | LT1_PE500 | 1,420,906,146 | 142,090,614,600 | 100 |
Long reads | ONT PromethION | LT1_PT | 24,339,507 | 171,726,287,587 | 13,283 |
Hi-C | Illumina NovaSeq | LT1_HiC | 935,185,202 | 140,277,780,300 | 150 |
In total, 172.22 Gbp raw long reads, giving 57× coverage, were produced from PromethION sequencing (Table
1). Ultra-long reads constituted 0.0075% of the long reads. Base-called raw reads with low quality were filtered by the default function of MinKNOW. Adapter sequences were trimmed using Porechop v.0.2.4 (RRID:
SCR_016967)
[26]. After preprocessing, 171.73 Gbp cleaned reads (57× coverage) was obtained.
One
de novo assembly was performed using wtdbg2 v2.5 (RRID:
SCR_017225)
[27] with cleaned (filtered and trimmed) long reads. Parameters for the assembly were set as ‘−x ont −g 3g −L 5000’. For error correction of assembled contigs, we utilized a two-step strategy. First, correcting base-errors on contigs was carried out using four iterations of Racon v1.4.3 (RRID:
SCR_017642)
[28]. The parameters for Racon were ‘−m 8 −x −6 −g −8 −w 500’. As the second step, error correction using Medaka v0.11.5
[29] was performed with a pre-trained model for Flip-Flop. To improve SNP and indel accuracy of the assembly, we polished the consensus with short reads using two rounds of Pilon v1.23 (RRID:
SCR_014731)
[30].
The second
de novo assembly was performed using the Shasta v0.4.0
[31] assembler with the default parameters. For error correction of assembled contigs, MarginPolish v1.3
[32] and HELEN v0.0.1
[33] were used with default options. Using the two assemblers allowed us to compare the outcome of the two methods and choose one outperforming method for downstream analyses.
To generate a chromosome-level assembly for the LT1 genome, scaffolding with 72 GB of Hi-C reads was performed using Juicer (RRID:
SCR_017226)
[34] and 3D-DNA pipeline (RRID:
SCR_017227)
[35]. To correct misassemblies in the scaffolds, JBAT v1.11.08 (RRID:
SCR_021172)
[36] was used for manual curation. The map assembly was assessed using Nucmer v4.00beta2 (RRID:
SCR_018171)
[37] and Dot
[38] against the human reference genome, GRCh38. All contigs from mitochondrial DNA were scaffolded to a complete mitochondrial (mtDNA) assembly using the same workflow as for the nuclear DNA.
Owing to the absence of LT1 parental genome data, a read-based phasing of the assembly was performed using Medaka and WhatsHap v1.0
[39] and shared on the LT1 webpage
[40]. Since the variant calling module of Medaka includes variant calling and phasing steps with sequenced reads from ONT using WhatsHap, filtered and trimmed PromethION reads were mapped against the assembled scaffolds, and assembled scaffolds were phased using Medaka. As a result of read-based phasing, 2,299,025 variants were phased from 3,901,968 variants, and the number of phased blocks was 8879 (See Table S1 in GigaDB
[41]). Phased genome sequences were extracted using Bcftools v1.9 (RRID:
SCR_005227)
[42] with a command-line of “bcftools consensus -H 1 -f reference.fasta phased.vcf.gz > haplotype1.fasta”.
Merqury v1.0
[43] was used to assess LT1 assembly using
k-mers. QUAST v5.0.2
[44] was used against GRCh38 and CHM13 v1.1
[45] to assess misassemblies on the LT1 assembly.
Genome annotation
For annotation, protein-coding genes from GRCh38 were prepared. BRAKER v2.1.4 (RRID:
SCR_018964)
[48] with GeneMark-ES v4.38 (RRID:
SCR_011930)
[49] and Augustus v3.3.3 (RRID:
SCR_008417)
[50] were used to predict genes in LT1. Predicted genes were assessed using BUSCO v4.1.0 (RRID:
SCR_015008)
[51] with the mammalian orthologous gene set v10. Functional annotation of predicted genes was performed using NCBI-BLAST + v2.9.0 (RRID:
SCR_004870) against the National Center for Biotechnology Information (NCBI) non-redundant protein database
[52] and the Swiss-Prot database (RRID:
SCR_002380)
[53].
Short indel and SNV calling
Variant calling was performed on preprocessed short reads using GATK v4.1.7 HaplotypeCaller (RRID:
SCR_001876)
[56] with –output-mode, EMIT_VARIANTS_ONLY and -stand-call-conf 30 settings. Preprocessed short reads were aligned to the GRCh38 reference genome using bwa aligner v0.7.15 (RRID:
SCR_010910)
[57] and sorting was carried out by SAMTOOLS v0.1.19 (RRID:
SCR_002105)
[58]. Duplicate marking and quality metric assessment were conducted using picard v1.3.2 (RRID:
SCR_006525)
[59]. Base quality scores from the alignment files were recalibrated using BaseRecalibrator and ApplyBQSR tools from GATK
[60]. For SNV and indel recalibration, dbSNP v146 (RRID:
SCR_002338) and Mills_and_1000G_gold_standard.indels sets were used, respectively.
Structural variation analysis
SVs were identified from preprocessed nanopore reads using the Sniffles-based meta-pipeline NextSV2
[22] and the independent caller SVIM
[21]. We used a NextSV2 method, which employs Minimap2 (RRID:
SCR_018550)
[61], Sniffles v1.0.11 (RRID:
SCR_017619)
[62], and GRCh38 as the reference with all default settings. Default settings were also used for SVIM. Structural variants with genotypes 0/0, or supported by less than 10 reads, were filtered out and not presented in the final results. For SV merging and to estimate shared variants (as well as a union of the variants), we employed SURVIVOR v1.0.7
[63] with options 1000 (bp) for the window size and 30 (bp) for minimum SV length. SURVIVOR merging output was further used for AnnotSV v2.3
[64] multiple database annotation. Copy number variation (CNV) was estimated using CNVnator v.0.3.3 (RRID:
SCR_010821)
[65] with default parameters and output filtering settings
q0 < 0.5, where
q0 is the fraction of reads mapped with 0 (zero) mapping quality. The complete size distribution of the consensus deletions, insertions, and inversions was plotted in R using data.table, ggplot2 (RRID:
SCR_014601)
[66], and SiMRiv
[67].