21 months ago

Hello, I have a first file containing a lot of lines.

9   141016262   rs2278973   T   G   .   PASS    ENSG00000148408;ENST00000277551|ENST00000277549 GT  T|G G   G   T|G G   G   G   G   G   G   G   G   T|G G|G G   G   G   G   G   G   T   T|G G   T|G G   T|G G   T|G G   T|G G   G   G   T|G T|G G   G   G   G   G
9   141016271   rs201383337 C   T   .   PASS    ENSG00000148408;ENST00000277551|ENST00000277549 GT  C   C   C   C   C   C   C   C   C   C   C   C   C   C|C C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
9   141016441   rs150679456 A   G   .   PASS    ENSG00000148408;ENST00000371372|ENST00000371363|ENST00000371357|ENST00000371355 GT  A   A   A   A   A   A   A   A   A   A   A   A   A   A|A A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A   A
10  225960  rs370081585 G   A   .   PASS    ENSG00000015171;ENST00000439456|ENST00000397962|ENST00000509513|ENST00000381591|ENST00000403354|ENST00000402736|ENST00000602682|ENST00000397955|ENST00000558098|ENST00000381607|ENST00000397959|ENST00000309776 GT  G   G   G   G|G G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G
10  292763  rs150625727 C   T   .   PASS    ENSG00000015171;ENST00000397962|ENST00000509513|ENST00000381591|ENST00000403354|ENST00000402736|ENST00000602682|ENST00000381584|ENST00000558098|ENST00000627286|ENST00000381607|ENST00000397959|ENST00000309776|ENST00000381604 GT  C   C   C   C|C C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
10  294892  rs781271016 T   C   .   PASS    ENSG00000015171;ENST00000397962|ENST00000509513|ENST00000381591|ENST00000403354|ENST00000402736|ENST00000602682|ENST00000381584|ENST00000558098|ENST00000627286|ENST00000381607|ENST00000397959|ENST00000309776|ENST00000381604 GT  T   T   T   T|T T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T   T
10  327162  rs142438404 C   T   .   PASS    ENSG00000151240;ENST00000280886|ENST00000634311|ENST00000381496 GT  C   C   C   C|C C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C


However I would like to keep only the lines so the identifier is in my second file.

rs370081585
rs150625727


desired result:

10  225960  rs370081585 G   A   .   PASS    ENSG00000015171;ENST00000439456|ENST00000397962|ENST00000509513|ENST00000381591|ENST00000403354|ENST00000402736|ENST00000602682|ENST00000397955|ENST00000558098|ENST00000381607|ENST00000397959|ENST00000309776 GT  G   G   G   G|G G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G   G
10  292763  rs150625727 C   T   .   PASS    ENSG00000015171;ENST00000397962|ENST00000509513|ENST00000381591|ENST00000403354|ENST00000402736|ENST00000602682|ENST00000381584|ENST00000558098|ENST00000627286|ENST00000381607|ENST00000397959|ENST00000309776|ENST00000381604 GT  C   C   C   C|C C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C


code tried:

while read line; do awk '/$line/ { print$0 }' vcf.vcf; done < rs.txt


thank you

A very big thank you, I couldn't get away with it anymore. Thank you!

I have moved the comments pointing you to the right solution to a "Answer", so you can mark them as accepted and as such indicate this thread is solved.

21 months ago
GenoMax

Take a look at inline help for grep. Especially the -w and -f (a file with things you want to search for) options.

If you are filtering a VCF file then use tools meant for managing VCF files such as bcftools and vcftools.

+1 on using bcftools. You awk will benefit greatly by matching just one column (\$3) instead of the entire line. But definitely go for bcftools view

21 months ago

You can try this where one will contain list of ids only and other file having all data including ids

grep -Fwf List_of_ids all_data

However, it only looks for the last line of the List_of_ids file in the all_data file and not the whole list...

No, it shouldn't do that, it should work. Did you make the List_of_ids on Windows? If so, use dos2unix on it, as the line endings might be different.

Thank you! This is exactly the solution to my problem: an encoding problem!