Question

Filter unique SNPs (rows) in VCF/text file

1

Entering edit mode

5.7 years ago

evelyn ▴ 230

Dear all,

I want to filter for unique SNPs. For example, these two different positions have the same SNPs called A C C. So I just want to keep one row for all duplicate rows.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO     Sample1 Sample2 Sample3
Chr01  4560 .   A   C   20.6048 .  A C C
Chr01  6476 .   A   C   37.3745 .  A C C

In R, I used unique() function from data.table package. But I want to do this using Linux command to reduce the file size for downstream analysis in R.

I have two files.

vcf file with multiple samples: - SNP calling with bcftools and filtered for INDELS. I have numbers for SNP calls such as 0/0, 0/1, 1/1, etc. for this file.
text file with multiple samples: - SNP calling with bcftools and filtered for INDELS. Numbers for SNP calls such as 0/0, 0/1, 1/1, etc are converted to letters such as A, G, T, C, etc using bcftools query and awk. This file does not have header information and is no longer a vcf. For example

Sample1 Sample2 Sample3
```
        G G G

        G G G

        A A A
```

I want to remove the same SNP calls for different samples at different positions as shown above. I can use either of these files but I am not sure which one will be better and how.

Can anyone suggest something for this job? Thank you very much for your help!

snp vcf bash • 3.0k views

ADD COMMENT • link updated 5.7 years ago by zx8754 12k • written 5.7 years ago by evelyn ▴ 230

1

Entering edit mode

I think this is called "LD pruning". See related post, using bcftools:

A: VCFtools version for LD calculations specifying bin size

ADD REPLY • link 5.7 years ago by zx8754 12k

0

Entering edit mode

Thank you! I am not sure if LD pruning is the right thing for this. I also tried bcftools norm -d snps for file type 1, but nothing is removed. I have the vcf filtered for INDELS already. I want to filter duplicate SNP calls at different POS.

ADD REPLY • link 5.7 years ago by evelyn ▴ 230

0

Entering edit mode

How big is the VCF/text file, number of samples, number of SNPs? Maybe just use R? For example for importing file the data.table::fread is fast.

ADD REPLY • link 5.7 years ago by zx8754 12k

1

Entering edit mode

The vcf file is 80 GB large before doing any filtering. R crashes if I don't filter duplicate SNPs in Linux.

ADD REPLY • link 5.7 years ago by evelyn ▴ 230

score 1 · Answer 1 · 2019-11-13

I don't fully understand why you want to do this, but if what you need is just to remove lines considered duplicated attending to certain columns, then this perl code should work:

perl -lane '
$index = join "\t", @F[0,7..$#F];
unless ( exists $found{$index} ) {
 print; $found{$index} = 1;
}' input.txt

what it does is to create an index string which will be checked before printing each line. in this example, the index string is built using columns 1 and 8-end, since these are the chromosome column (considering that you wouldn't like to remove variants from different chromosomes) and the samples' columns. you just have to select other columns in @F[0,7..$#F] if you need it.

zx8754 · Answer 2 · 2019-11-14

1

Entering edit mode

5.7 years ago

evelyn ▴ 230

I have found the solution using awk:

awk '! a[$0]++' input.txt > output.txt

It removes repeat SNP rows and keeps one.

ADD COMMENT • link updated 5.7 years ago by zx8754 12k • written 5.7 years ago by evelyn ▴ 230

0

Entering edit mode

Maybe sort -u input.txt > output.txt ?

ADD REPLY • link 5.7 years ago by zx8754 12k

0

Entering edit mode

sorting such a big file only for being able to use the uniq function wouldn't be the more efficient way to do it, plus a simple sort -u wouldn't address the problem described in the question as the lines that need to be merged/skipped do vary. if sorting would be considered (and it would be really demanding on big files, such as the 80GB vcf file mentioned), then the columns of interest should be specified, as in

sort -u -k1,1 -k7,7 -k8,8 -k9,9 input.txt

anyway, I would really go for the perl solution I suggested previously. it is much faster than sort and it is the less resource demanding option I can think of.

ADD REPLY • link 5.6 years ago by Jorge Amigo 14k

0

Entering edit mode

this is the same as an uniq function on entire lines, and it doesn't work with your example considering that positions qualities vary from line to line. you need to group (index) only the columns you expect to be repeated, as explained in the example I suggested previously.

ADD REPLY • link 5.6 years ago by Jorge Amigo 14k