I'm pretty new in analysing SV and CNV. We have WGS PacBio long reads germline sequencing reads that have been aligned using pbalign tool to reference sequences （GRCh37/hg19). Variants were called using the variantCaller.py from GenomicConsensus package (https://github.com/PacificBiosciences/GenomicConsensus). Variants were annotated with ANNOVAR for protein coding changes, affected genomic regions, allele frequency reported by some big projects, deleteriousness prediction, etc. The whole process were done by the sequencing company that we sent our sample to.
I am more familiar with Illumina WGS short reads sequencing and analysing SNV and small indels. I have few question regarding SV and CNV analyses.
1) I would like to do some QC to the results (eg check variant counts based on variant type) that we have obtained from the sequencing company but I am not too sure what are the accepted standards. Could you please point me to the right literatures or blogs?
2) For my previous analysis for Illumina WGS short reads sequencing, I used to use two variant different callers and get the unity of both callers, and the results are better than just using one caller. Upon reading, I found a workflow that combines multiple callers https://github.com/wdecoster/nano-snakemake that seem to perform better. Has anyone benchmarked different callers, combined, alone?
3) I am not too sure what information should I use to filter for diseased related SV and CNV. For SNV and small indels for example, I would use information such as population frequency based on gnomAD, in silico predictions eg REVEL, PROVEAN and MaxEntScan, and clinical information based on ClinVar database. Could you please suggest the information (and relevant databases) that I can use to prioritize SVs and CNVs.