Question

vg call problem request

0

Entering edit mode

23 months ago

20110700081 • 0

Hello, recently I'm genotyping some SVs using illumila data, however, I have met some problems that I can not figure it out. The questions are as follows:

For some large DEL SVs, for example, SV length > 1kb, the variants will be filtered in the final vcf file, should I change some parameters to avoid this problem?
For some INS SVs, maybe the position was not so precise. For example, I did a PCR to confirm the position of an INS, and did both genotyping by using the modified position and the original position, but the genotyping results were not consistent. Because it is not easy to do PCR to correct the position of all the SVs, could you please give me some suggestions to deal with this situation?

Thank so much!

DEL accuracy INS call vg • 1.1k views

ADD COMMENT • link updated 23 months ago by dthorbur ★ 3.1k • written 23 months ago by 20110700081 • 0

0

Entering edit mode

You need to provide more information. What SV calling pipeline have you used, and what parameters did you use for filtering, what is the depth of your sequencing, is it paired end data, how many samples do you have? All of these are important to consider when analysing variant calling results.

I wouldn't consider SVs greater than 1kb to be large, and they are usually retained in my datasets. Is this due to a parameter you set or the software? If you are looking for cutoffs, you can either make that data driven (i.e., trimming the top and bottom percentiles to remove long tails) or use literature in your field to find common kb thresholds.

Was the insertion fixed in the population you sampled? If there is heterozygosity for this INS at this loci then the PCR will be inconsistent.

ADD REPLY • link 23 months ago by dthorbur ★ 3.1k

0

Entering edit mode

Here is my pipeline:

vg giraffe -Z test.gbz -m test.min -t 8 -b fast --rescue-algorithm "dozeu" -N test -d test.dist -f sample1_1.fq.gz -f sample_2.fq.gz 2>run.log  1>test.mapped.gam 
vg pack -x test.xg -g test.mapped.gam -o test.mapped.gam.pack  -Q 5
vg call test.xg -r test.snarls  -k test.mapped.gam.pack  -t 30 -a -s sample1  > sample1.genotypes.vcf

And the depth of the sequencing is 38X, it is paired end data, I have 736 samples to be genotyped. And before doing genotyping, I want to test the calling pipeline by using a sample who has both second and third generation sequencing results. I got ~57,000 SVs by using third generation sequencing results, and constructed corresonding files(.dist, .min and .xg) by using the vcf file produced by the third generation sequencing results. However, I counld not abtain DEL SVs longer than 1Kb by using this pipeline, it seems longer DEL SVs were filtered. Could you please give me some suggestions? Thanks so much!

ADD REPLY • link updated 23 months ago by GenoMax 154k • written 23 months ago by 20110700081 • 0

0

Entering edit mode

Unfortunately, I haven't used a pangenome tool before so I don't know what considerations need to be made to use them or how they may have contributed to this strange result. So these suggestions are very general in how I would approach this.

Have you tried inspecting individual samples to see if you can spot some realistic long DELs? I don't know if this tool emits a bam, but you could inspect using IGV it does.

Are there any other pangenome tools you could use? Because inferring breakpoints from sequencing data is messy, you often see two tools used for SV analyses like DELLY2 and LUMPY, though I don't know if these work for pangenomes (I doubt it). Might be worth running a few samples through another pipeline to see if the results are similar.

ADD REPLY • link 23 months ago by dthorbur ★ 3.1k