I would like to filter sequences (in command line Unix) with grep or BBmap based on a list of names stored in a separate file. The list has the names but not the full names, just part of the full sequence name.
The name list looks like:
cre cln pab pde pta ppt smo atr seu pgi cca han
The sequences look like this (the names, from the list I have, are at the beggining (first three characters)):
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p UGAGGUAGUAGGUUGUAUAGUU >cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p CUAUGCAAUUUUCUACCUUACC >cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p UCCCUGAGACCUCAAGUGUGA >pad-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p ACACCUGGGCUCUCCGGGUACC >pad-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p CAUACUUCCUUACAUGCCCAUA >cel-miR-1-3p MIMAT0000003 Caenorhabditis elegans miR-1-3p UGGAAUGUAAAGAAGUAUGUA >cel-miR-2-5p MIMAT0020302 Caenorhabditis elegans miR-2-5p CAUCAAAGCGGUGGUUGAUGUG >cca-miR-2-3p MIMAT0000004 Caenorhabditis elegans miR-2-3p UAUCACAGCCAGCUUUGAUGUGC >cca-miR-34-5p MIMAT0000005 Caenorhabditis elegans miR-34-5p AGGCAGUGUGGUUAGCUGGUUG
My BBmap code is the following:
./bbmap/filterbyname.sh in=mature.fa out=filtered.fa include=t names=names.txt substring
I don't have an idea for grep.
The problem is that this code filters other sequences too (wrong sequences) because I don't know how to tell to filter only those where name present at the beginning. Maybe 'grep' would be better?
Please help. Best wishes, thend