I want to filter for unique SNPs. For example, these two different positions have the same SNPs called
A C C. So I just want to keep one row for all duplicate rows.
#CHROM POS ID REF ALT QUAL FILTER INFO Sample1 Sample2 Sample3 Chr01 4560 . A C 20.6048 . A C C Chr01 6476 . A C 37.3745 . A C C
In R, I used
unique() function from
data.table package. But I want to do this using Linux command to reduce the file size for downstream analysis in R.
I have two files.
vcf file with multiple samples: - SNP calling with bcftools and filtered for INDELS. I have numbers for SNP calls such as 0/0, 0/1, 1/1, etc. for this file.
text file with multiple samples: - SNP calling with bcftools and filtered for INDELS. Numbers for SNP calls such as 0/0, 0/1, 1/1, etc are converted to letters such as A, G, T, C, etc using bcftools query and awk. This file does not have header information and is no longer a vcf. For example
Sample1 Sample2 Sample3
G G G G G G A A A
I want to remove the same SNP calls for different samples at different positions as shown above. I can use either of these files but I am not sure which one will be better and how.
Can anyone suggest something for this job? Thank you very much for your help!