Question

Duplicate results in vardict vcf

0

Entering edit mode

2.9 years ago

ww22runner ▴ 60

Hello everyone,

I am using Vardict to look for SNVs in an unpaired sample and this is the command I am using:

$VARDICT -G $REF_GENOME_b37 -f 0.03 -N "my_sample" -b "${SAMPLE_PFX}_UN.bam" -r 5 -z 1 -t -c 1 -S 2 -E 3 -g 4 $BEDFILE|sed '1d'|teststrandbias.R|var2vcf_valid.pl -N "unpaired" -f 0.03 -v 5 > ${SAMPLE_PFX}_vardict.vcf

However, whether I use the -t flag or -F 0x500 flag, I keep getting duplicate result rows in my vcf:

enter image description here

Both rows are the same and I cannot understand why the duplicate was not filtered, any advice would be greatly appreciated!

Thank you

vardict • 1.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by ww22runner ▴ 60

2

Entering edit mode

Do not open VCF files in Excel. Instead, get those two lines on the command line and use uniq to find if they have the same content. If not, you have a lead to look for cause of difference. If they have the exact same content, an expert on VarDict would be in a better place to answer your question.

ADD REPLY • link 2.9 years ago by Ram 43k

0

Entering edit mode

Hello Ram, thank you for your suggestion, they are the same content. Somehow, using the --deldupvar option removes the duplicate but the -t option or -F 0x504, both of which should remove duplicates do not work. If you know the difference between these options, please do let me know, thanks.

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by ww22runner ▴ 60

score 1 · Answer 1 · 2021-06-09

If you are a Python user, you may want to check out the pyvcf submodule I wrote. If the duplicate rows are truly duplicate in the sense that all the fields are identical, you can clean up your VCF file this way:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2', 'chr2'],
...     'POS': [100, 101, 101],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'C'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Steven': ['0/1', '1/1', '1/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1
2  chr2  101  .   T   C    .      .    .     GT    1/1
>>> df = vf.df.drop_duplicates()
>>> filtered_vf = pyvcf.VcfFrame([], df)
>>> filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1

If your duplicate rows have the same POS, REF, ALT, ... but have slightly different genotype data for some reason, above will not work. Let me know if this is the case.