Filter dosage file by list of SNP IDs
1
0
Entering edit mode
7 weeks ago
James ▴ 10

Hello, does anyone by any chance know of a fast/computationally efficient way to select lines in a .dosage file if the first column's SNP ID is also contained within a .txt document of SNP IDs?

The .dosage file is in the following format:

SNPID Position REF ALT Sample1Dosage Sample2Dosage Sample3Dosage . . .
1:100:A:C A C 0 2 1 . . .
1:101:C:T C T 1 2 1 . . .
. . .


The list of SNP IDs in a .txt document is in the following format:

1:100:A:C
1:101:C:T
1:103:G:A
1:105:C:T


. . .

I have tried using grep -f snp_IDs.txt example.dosage > filtered_example.dosage, but the command is unfortunately too slow for my server to run it without hitting the max wall time

dosage snp genomics • 190 views
1
Entering edit mode
7 weeks ago
James ▴ 10

Found the solution myself, but keeping this question up for others who may run into the same problem. Instead of using:

grep -f snp_IDs.txt example.dosage > filtered_example.dosage


Use:

grep -F -f snp_IDs.txt example.dosage > filtered_example.dosage


This runs extremely fast! (as long as you don't have to filter on any regex expressions)