Dear all,
I want to filter for unique SNPs. For example, these two different positions have the same SNPs called A C C
. So I just want to keep one row for all duplicate rows.
#CHROM POS ID REF ALT QUAL FILTER INFO Sample1 Sample2 Sample3
Chr01 4560 . A C 20.6048 . A C C
Chr01 6476 . A C 37.3745 . A C C
In R, I used unique()
function from data.table
package. But I want to do this using Linux command to reduce the file size for downstream analysis in R.
I have two files.
vcf file with multiple samples: - SNP calling with bcftools and filtered for INDELS. I have numbers for SNP calls such as 0/0, 0/1, 1/1, etc. for this file.
text file with multiple samples: - SNP calling with bcftools and filtered for INDELS. Numbers for SNP calls such as 0/0, 0/1, 1/1, etc are converted to letters such as A, G, T, C, etc using bcftools query and awk. This file does not have header information and is no longer a vcf. For example
Sample1 Sample2 Sample3
G G G G G G A A A
I want to remove the same SNP calls for different samples at different positions as shown above. I can use either of these files but I am not sure which one will be better and how.
Can anyone suggest something for this job? Thank you very much for your help!
I think this is called "LD pruning". See related post, using bcftools:
Thank you! I am not sure if LD pruning is the right thing for this. I also tried
bcftools norm -d snps
for file type 1, but nothing is removed. I have the vcf filtered forINDELS
already. I want to filter duplicate SNP calls at differentPOS
.How big is the VCF/text file, number of samples, number of SNPs? Maybe just use R? For example for importing file the
data.table::fread
is fast.The vcf file is 80 GB large before doing any filtering. R crashes if I don't filter duplicate SNPs in Linux.