Amount of unique variants in VCF with linux terminal
2
0
Entering edit mode
6 weeks ago
HL • 0

I have been trying to search how could I get the amount of unique variants from a very big VCF file with linux terminal, without any results. Also I would like to get the unique variants to a different file without any duplicates.

variants unique VCF linux • 686 views
1
Entering edit mode

What do you mean by 'unique variants'?

0
Entering edit mode

I would need to find variants that has same CHROM POS REF and ALT. And then remove the duplicates so that I would get these variants only once to the new file. Also I would need to know the amount of duplicated variants.

For example this file I need to know that the total amount of variants is 4 and amount of different variants is 3.

CHROM   POS  REF  ALT
1                1234    A        T
1                1234    A        T
1                3457    T        C
1                5468    G        C


And then I need to get the different variants to their own file like this.

CHROM   POS  REF  ALT
1                1234    A        T
1                3457    T        C
1                5468    G        C

0
Entering edit mode

Maybe try bcftools:

bcftools query -f '%ID\n' myFile.vcf

0
Entering edit mode

Thanks, this print all the ID:s but I would need only the ones that has two or more same CHROM POS REF and ALT

0
Entering edit mode

0
Entering edit mode
5 weeks ago
sbstevenlee ▴ 140

Check out the pyvcf.VcfFrame.drop_duplicates method I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['1', '1', '1', '1'],
...     'POS': [1234, 1234, 3457, 5468],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['A', 'A', 'T', 'G'],
...     'ALT': ['T', 'T', 'C', 'C'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', './.', '0/1', '0/0'],
...     'B': ['./.', '0/1', '0/0', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  1234  .   A   T    .      .    .     GT  ./.  0/1
2     1  3457  .   T   C    .      .    .     GT  0/1  0/0
3     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf = vf.drop_duplicates(['CHROM', 'POS', 'REF', 'ALT'])
>>> filtered_vf.df
CHROM   POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0     1  1234  .   A   T    .      .    .     GT  0/1  ./.
1     1  3457  .   T   C    .      .    .     GT  0/1  0/0
2     1  5468  .   G   C    .      .    .     GT  0/0  0/1
>>> filtered_vf.to_file('out.vcf')

0
Entering edit mode

The above operation seems to lose information - the GT for sample B at locus 1:1234 is lost after the drop operation. You may want to pick called GTs over uncalled GTs when processing duplicates (this won't be an easy operation)

1
Entering edit mode

Thanks for the reply, Ram. You're correct. By default (keep='first'), the method will only keep the first of the duplicate records. I thought this was sufficient for the OP's case because it sounded like he or she is more interested in getting the total number of unique records than getting the collapsed genotypes. If the desired output is a collapsed VCF, then the OP is advised to take a look at the pyvcf.VcfFrame.collapse method.

0
Entering edit mode
5 weeks ago
mbk0asis ▴ 630

Why don't you just paste(?) columns together as single strings and use 'uniq' command?

CHROM POS REF ALT
1 1234 A T
1 1234 A T
1 3457 T C
1 5468 G C


Somethin like this?

sed 's/\t/_/g' FILE | sort | uniq -c

2 1_1234_A_T
1 1_3457_T_C
1 1_5468_G_C
1 CHROM_POS_REF_ALT


Numbers in front are the duplicate counts.

0
Entering edit mode

I don't think the paste is required, uniq should work fine without it too.

0
Entering edit mode

Thanks this worked and also I got the unique ones with this too:

bcftools query -f '%CHROM %POS %REF %ALT\n' FILE > UNIQFILE


After this I just had to make some headers for this UNIQFILE to be a vcf file again. Also I have to find a way to get for these variants all of the information (for example SAMPLE QUAL...) but I think I can get it with bedtools intersect.

1
Entering edit mode

That doesn't make any sense, HL. You're not performing any uniquification operations. There should at least be a uniq before you redirect the output to UNIQFILE.

You either need to apply uniq directly to the data lines in the VCF file or you need to apply it to a subset of the columns. What you're doing here will just recreate the file after taking a few detours that do a bunch of stuff serving no purpose.