Question

Prioritizing structural variants and CNV

2

Entering edit mode

4.5 years ago

jan ▴ 170

Hi All,

I'm pretty new in analysing SV and CNV. We have WGS PacBio long reads germline sequencing reads that have been aligned using pbalign tool to reference sequences （GRCh37/hg19). Variants were called using the variantCaller.py from GenomicConsensus package (https://github.com/PacificBiosciences/GenomicConsensus). Variants were annotated with ANNOVAR for protein coding changes, affected genomic regions, allele frequency reported by some big projects, deleteriousness prediction, etc. The whole process were done by the sequencing company that we sent our sample to.

I am more familiar with Illumina WGS short reads sequencing and analysing SNV and small indels. I have few question regarding SV and CNV analyses.

1) I would like to do some QC to the results (eg check variant counts based on variant type) that we have obtained from the sequencing company but I am not too sure what are the accepted standards. Could you please point me to the right literatures or blogs?

2) For my previous analysis for Illumina WGS short reads sequencing, I used to use two variant different callers and get the unity of both callers, and the results are better than just using one caller. Upon reading, I found a workflow that combines multiple callers https://github.com/wdecoster/nano-snakemake that seem to perform better. Has anyone benchmarked different callers, combined, alone?

3) I am not too sure what information should I use to filter for diseased related SV and CNV. For SNV and small indels for example, I would use information such as population frequency based on gnomAD, in silico predictions eg REVEL, PROVEAN and MaxEntScan, and clinical information based on ClinVar database. Could you please suggest the information (and relevant databases) that I can use to prioritize SVs and CNVs.

Thank you.

PacBio CNV SV • 1.8k views

ADD COMMENT • link updated 4.5 years ago by WouterDeCoster 47k • written 4.5 years ago by jan ▴ 170

score 5 · Accepted Answer · 2019-10-29

I see you separate CNV from SV in your question, while in my opinion, a CNV is a subtype of an SV. So I don't consider them separately.

1) I would like to do some QC to the results (eg check variant counts based on variant type) that we have obtained from the sequencing company but I am not too sure what are the accepted standards. Could you please point me to the right literatures or blogs?

For a human genome, you would expect ~25k-30k structural variants (defined as any variant larger than 50 bp). So you can count them. What is also useful is making a variant length histogram. For humans you expect a peak around 300 bp (SVs involing Alu elements) and 6 kb (SVs involving L1 elements). I have a tool which does this and more SV handling tasks called surpyvor. For some more complex tasks it's a convenient wrapper around SURVIVOR . surpvyor lengthplot will give you the SV length plot and variant counts per type.

There have been a couple of papers on SVs from long reads, including the one from Audano et al., which can give you some idea of what to expect.

2) For my previous analysis for Illumina WGS short reads sequencing, I used to use two variant different callers and get the unity of both callers, and the results are better than just using one caller. Upon reading, I found a workflow that combines multiple callers https://github.com/wdecoster/nano-snakemake that seem to perform better. Has anyone benchmarked different callers, combined, alone?

Sensitivity can be boosted by taking the union of calls, specificity by taking the intersection. We have done some benchmarking of callers in our paper. I welcome all feedback on our snakemake workflow.