The multiline fasta file, comprises of 38667 sub-sequences,
Each header starts with the ">" and has very long headers.
How do I extract the some sub-sequences using a specific keyword
examples - ID's (one ID per line) XP_034407502.1 XP_034416580.1 XP_034403031.1
I had 5000 ID's in a text file.
I tried with faSomeRecords, generates an empty file
Before computing faSomeRecords, I removed ">" in IDs.txt file
faSomeRecords lumpsC.fa IDs.txt IDs.txt.fa ls -lh IDs.txt.fa -rw-rw-r-- 1 sun sun 0 Jun 22 05:11 IDs.txt.fa
I know we can do it by grep, its quite hard to do it manually for 5000 ID's. Some suggestions please.
grep "XP_034407502.1" lumpsC.fa > retreive_IDs_IDs.txt cat retreive_IDs_IDs.txt >lcl|NC_046980.1_cds_XP_034407502.1_23875 [gene=LOC117743771] [db_xref=GeneID:117743771] [protein=sprouty-related, EVH1 domain-containing protein 2-like] [protein_id=XP_034407502.1] [location=join(2056311..2056336,2070742..2070922,2071008..2071176,2071587..2071651,2075371..2075529,2075652..2076434)] [gbkey=CDS]