Question: Filter unique SNPs (rows) in VCF/text file
1
gravatar for evelyn
12 months ago by
evelyn110
evelyn110 wrote:

Dear all,

I want to filter for unique SNPs. For example, these two different positions have the same SNPs called A C C. So I just want to keep one row for all duplicate rows.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO     Sample1 Sample2 Sample3
Chr01  4560 .   A   C   20.6048 .  A C C
Chr01  6476 .   A   C   37.3745 .  A C C

In R, I used unique() function from data.table package. But I want to do this using Linux command to reduce the file size for downstream analysis in R.

I have two files.

  1. vcf file with multiple samples: - SNP calling with bcftools and filtered for INDELS. I have numbers for SNP calls such as 0/0, 0/1, 1/1, etc. for this file.

  2. text file with multiple samples: - SNP calling with bcftools and filtered for INDELS. Numbers for SNP calls such as 0/0, 0/1, 1/1, etc are converted to letters such as A, G, T, C, etc using bcftools query and awk. This file does not have header information and is no longer a vcf. For example

    Sample1 Sample2 Sample3

            G G G
    
            G G G
    
            A A A
    

I want to remove the same SNP calls for different samples at different positions as shown above. I can use either of these files but I am not sure which one will be better and how.

Can anyone suggest something for this job? Thank you very much for your help!

bash snp vcf • 633 views
ADD COMMENTlink modified 12 months ago by zx87549.7k • written 12 months ago by evelyn110
1

I think this is called "LD pruning". See related post, using bcftools:

ADD REPLYlink modified 12 months ago • written 12 months ago by zx87549.7k

Thank you! I am not sure if LD pruning is the right thing for this. I also tried bcftools norm -d snps for file type 1, but nothing is removed. I have the vcf filtered for INDELS already. I want to filter duplicate SNP calls at different POS.

ADD REPLYlink modified 12 months ago • written 12 months ago by evelyn110

How big is the VCF/text file, number of samples, number of SNPs? Maybe just use R? For example for importing file the data.table::fread is fast.

ADD REPLYlink written 12 months ago by zx87549.7k
1

The vcf file is 80 GB large before doing any filtering. R crashes if I don't filter duplicate SNPs in Linux.

ADD REPLYlink written 12 months ago by evelyn110
1
gravatar for Jorge Amigo
12 months ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I don't fully understand why you want to do this, but if what you need is just to remove lines considered duplicated attending to certain columns, then this perl code should work:

perl -lane '
$index = join "\t", @F[0,7..$#F];
unless ( exists $found{$index} ) {
 print; $found{$index} = 1;
}' input.txt

what it does is to create an index string which will be checked before printing each line. in this example, the index string is built using columns 1 and 8-end, since these are the chromosome column (considering that you wouldn't like to remove variants from different chromosomes) and the samples' columns. you just have to select other columns in @F[0,7..$#F] if you need it.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Jorge Amigo12k
1
gravatar for evelyn
12 months ago by
evelyn110
evelyn110 wrote:

I have found the solution using awk:

awk '! a[$0]++' input.txt > output.txt

It removes repeat SNP rows and keeps one.

ADD COMMENTlink modified 12 months ago by zx87549.7k • written 12 months ago by evelyn110

Maybe sort -u input.txt > output.txt ?

ADD REPLYlink written 12 months ago by zx87549.7k

sorting such a big file only for being able to use the uniq function wouldn't be the more efficient way to do it, plus a simple sort -u wouldn't address the problem described in the question as the lines that need to be merged/skipped do vary. if sorting would be considered (and it would be really demanding on big files, such as the 80GB vcf file mentioned), then the columns of interest should be specified, as in

sort -u -k1,1 -k7,7 -k8,8 -k9,9 input.txt

anyway, I would really go for the perl solution I suggested previously. it is much faster than sort and it is the less resource demanding option I can think of.

ADD REPLYlink modified 12 months ago • written 12 months ago by Jorge Amigo12k

this is the same as an uniq function on entire lines, and it doesn't work with your example considering that positions qualities vary from line to line. you need to group (index) only the columns you expect to be repeated, as explained in the example I suggested previously.

ADD REPLYlink written 12 months ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1018 users visited in the last hour