P03: Structural variant calling with linked read sequencing data
Principal investigator: Prof. Dr. Birte Kehr
PhD student: Richard Lüpken
Genomic structural variants (SVs) can cause a multitude of human phenotypes, including genetic diseases. SVs are, however, notoriously difficult to call from short-read sequencing data due to their length and their frequent location within repetitive regions of the genome. These properties create ambiguities in short-read alignments. It is clear that we are in desperate need of additional long-range information if we are to obtain a more complete picture of structural variation. Both linked and long reads promise to provide the missing long-range sequence information.
Linked reads are a new and cost-effective type of sequencing data that provides long-range sequence information through barcodes which label short reads originating from the same long (~50,000 bp) DNA molecule. Though a very limited number of promising linked read data analysis tools are available, these tools rely on a given read alignment and do not take advantage of reads with ambiguous or without alignment. We hypothesize that utilizing all the linked read data will result in an SV call set that is more comprehensive than those generated with current short read or linked read analysis tools. Given our experience in assembling non-reference sequence, we are using the long-range sequence information to develop and implement a genome-wide local assembly approach for identifying SVs in linked read data.
During the first funding period, we developed the linked read mapper bcmap (Lüpken et al. bioRxiv 2022; https://github.com/kehrlab/bcmap) which maps linked reads at barcode level. This allows the use of reads which would remain unmappable individually. Furthermore the unified barcode mapping can be performed orders of magnitude faster than a conventional read alignment with comparable precision. The barcode mapping information provided by bcmap directly fits the requirements of our genome-wide local assembly tool bccall which we are finalising in 2023.
Long reads emerged roughly at the same time as linked reads. They provide the missing long-range information directly through their superior read length at the cost of per-base accuracy and lower throughput which results in higher sequencing cost. Recently, long reads have become cheaper, more accurate and have gained popularity in the scientific community compared to linked reads. We have since updated bcmap to also work with long reads and will add long read support for the entire variant calling workflow.
During the second funding period, we are expanding our variant calling workflow to allow for multi-sample variant calling. Through the use of more advanced data structures (i.e. coloured deBruijn graphs) for the assembly, we aim to detect variation by simultaneously evaluating the read data from sets of samples, like trios. We expect to achieve higher sensitivity and specificity for our SV call set as well as better comparability between jointly processed samples.
Our newly implemented tools allow the preparation of a comprehensive SV call set for the patient-derived linked and long read data generated by our RU. Our analysis tool will provide additional information about SVs and long-range haplotypes in whole genome sequencing datasets from patients defying molecular diagnosis by whole exome sequencing, thereby improving the molecular diagnostic rate for patients in the RU’s rare disease cohorts. This work is an integral part of the RU’s overarching goal to significantly reduce the length of the diagnostic odyssey for patients with rare diseases.