Question: sequence extraction from multifast file using IDs in another text file
0
gravatar for tcf.hcdg
3.7 years ago by
tcf.hcdg60
European Union
tcf.hcdg60 wrote:

Hello

I have a multi fast file and want to extract some of the sequence from big multifasta file. I tried the grep function in linux. It gives the out put results. 

HP-Pavilion-dv6-Notebook-PC ~/bioinformatics/tu/fastaretrival $ grep -A1 -wFf id.txt input.fasta > result.fa

But there are unexpected blanck in the result file with --  almost after every sequence.

"id" file is as follow

>C_c04661_3
>C_i00001_3
>C_i00001_6
>C_i00002_4
>C_i00007_4
>C_i00008_3
>C_i00009_4
>C_i00011_2
>C_i00012_5
>C_i00013_6

"input file is as follows"

>C_c04661_2
KKETFHSIQNVWIDNNHLTETVKFYENAVHEKRIRIFLELISHTIRLACSILVQMAFLPHHSSFLELQSLWSLTCVIQISHIPLCSSSGKDILHSLPQCFHPSHSSHCGLLLGDLKRSNQHASMDLFPVVGSPQHPESHLLPFFSSLPMLVQPQQVCVVPYHCHTNTVLLPVDYPEHPNPSVSPFLLTFPRLLWQDPLHSQKEQKFQIHLHLCTRFWGSILSHFPTLGSLQNISCSNHLSLSVEELEQQEGPEELKEISHHECAMRHLHLITLSCAPQCEPNPYPCIQHPPSIYDHQPWSQSHQHHPSHLIQEFQFHWQARQHQQQVSQGKKCCLFPGSVHTCVRHQKDSRSPYNQSPLSLCTEACSTQQMAPSFHHQHGNHKEQSFAQIPQNNIYELPPSYQPVQGGGLIGQVAINPLSFSIQKVSTVKDCAVLYPWLFRKMQIEPKRPYPTKKEIGFYIYVCVYASARHREIMYPNMCTAVEKKIIIIKK
>C_c04661_3
RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
>C_c04661_4
FFIIIIFFSTAVHIFGYIISLQCLAEAYTHTYIKPISFLVGYGLLGSQICIFLKSQGYSTAQSLTVDTFCIENERGFMATWPINHPPPCTGEGGSSILFCGIAQKDCSLFPCCWWKDGAICWVEHAYSVQRLNHGLYCQLGERESFCLTHVWTEPQGKRQHFFPETCYCCCLACQWNWNSYCIYKEGWCYLCDHQGYSYILGGQCWMQGGFGSHGAHESVQIRWRCLMAHSEISFSSSGPSCCSNSSTLNERFEQDIFWRLPSVGNQWLKMLPHQNRVHRRWINFCSFECSGSCQSNLGKVRRKGETDGFGCSGSTGSRTVFVWQYGTTHHTCGWTNMGYREEKKGRRWLSGCGDPQTTGNRSIDQACWFDLFRSPSKRPQDEEGKHWGSHEWRISLPEEEHSYGICEIWITQVRDQSDCSSRKELWGYHKNAICTYKILQANLIVWLISSRNILIRFSWTAFSNLTVSVKCFQLSIYHTFMENVSFF
>C_c04661_5
FFYYYYFFFYCSTHIWIHYLSSVPCRGIYTHIYIKANLFLGGIWSFGLSDLHFSEKPRIQYCTILNSRYFLYKEGVYGNLANQSPAALHRLIRRRLVDIILRNLSSKRLFFMISMLLMVERWCHLLGGACLFSTKTQSWTLLSIVGRTGVFLMSHTCVDASREKTAFLSLRNLLLLMLPRLPMELELLLYLQMRRVVLLMTLSPRLLMIIYTWWSMLDAGIRIWLTLRSTKCSNLKVEMSHGTFMMRDLFLFRALLLLQLFNTKMIARYILETSECWKSMAQNASSPEPGTQVKMDLKFLFLLRMQWILPKQSWKSQKERDRVWVLGIVLDWKQDCVCMAMIWNNTSHLLRLDHGLGREEGQKVAFWVLRSSNNWKQVHRSGVLVSLQVPQQEATVRLRMRRVKTLGKSRVEDLVLARRTLWDMNLDHTSQGPKLRLQFEERTMMGLSQKCHLYLQNTTSQPDLGVAYQFQKYSNSFLMDSIFIKLDCFSQMLISIVYLPHILNGMKCFFL
>C_c04661_6
FLLLLFFFLLQYTYLDTLSLFSALQRHIHTHIYKSQSLSWWDMVFWALRFAFFKAKDTVLHNPQSILFVLKMRGGLWQLGQSITRRLAQADKKAVARRYYSAESELKKTVLYDFHVANGGKMVPFAGWSMPIQYKDSIMDSTVNCRENGSLFDVSHMCGLSLKGKDSISFLEKLVIADVASLANGTGTLTVFTNEKGGAIDDSVITKVTDDHIYLVVNAGCRDKDLAHIEEHMKVFKSKGGDVSWHIHDERSLLALQGPLAAPTLQHLMKDDLSKIYFGDFRVLEINGSKCFLTRTGYTGEDGFEISVPSENAVDLAKAILEKSEGKVRLTGLGARDSLRLEAGLCLYGNDMEQHITPVEAGLTWAIGKRRRAEGGFLGAEVILKQLETGPSIRRVGLISSGPPARGHSEIKNEKGENIGEVTSGGFSPCLKKNIAMGYVKSGSHKSGTKVKIAVRGKNYDGAITKMPFVPTKYYKPTFRCGLSVPEIFFVSHGQHFHKTLFQSNANFNCLFTTHSEWNEMFLSS
>C_i00001_1
TKIYDSLIKQKIKISVKMNNQERMKTLTHFPCSHPLCVLSKWFKVVKQICNIHLLVLRPEIYKFKNLSCSKSKLNNALIALTELALMSGLSTKLTFNIRQVKEGKLLQRIIYIEALRHSTLRKITHSLLHSQDTFLVCIFVCLGDKRKFFWSLHCSFSDLCAHRLLVPRCYLYFAFTVTIKLCCSKFFLHSYCFNLDNHLCLLHSLIKILHVSKNSCLCRGFFTLIIIVLLSSETINTKRPTVLLKTSISRPTNLLTNYICSIYITQHNGAISCPLSDWWVGNRRSHFQSSDEERERERERERERPWLVYRDLHRLEDKARQVWYGTIGSWESYINVKKKNKKENKPKKKLTEEQEVIRRSTLTHSNPLNQSIPLIEADPTEDEATEQARLRRSNRLRQKYPPVVYAIHLANPLRIRVIGERRQLLLVLRVSADRGNRWWEFVPAKIWNILGYVLYYWRLILWIGFDDAVFVFIGLGICYLEKLLLGFLLWWLNSVQCSSCFCFSIRSNRDGLGWGFHVITCVNLLLFLGFFLKKNCNLFVIIIKKIL
>C_i00001_2
PNKSMIHLLNDKNKKSNQSKTTRRGRHHIFPVHIHYSVFFLSGSKNKYATFIYYDQRYISSKTSPVQRVNTMLNLSPLQSHLCLVVQNHLILDKKRGNNYFSASFTLRPGIPKLCARSLILSFIAKTLFWCASLYASVTRRESFSGPNSIAASVISVRTGCWFPDSVTFISSPSQPSNSAAASSFCTRTASTWTTICAFFTVSRFCMFPKIPAFAGASSPLSLLSFFPPNRPTQRDLRFCLKLRYPDQQICLPTTFAVEFTLHSDIEMEPFLVHSVTGGFDRGIVEAIFRVRIEKRERERERERERGHGWFTEICNIVKTRLVRYGMGRVLGRATSTESKRRTRKKTSPRKNSRRSRRSDGRRPTQIPTNQYRKPIQRRTRLPNRQGSSGDRTAFAKSIRLWFMQSIWQIHEESENEDNYYYEAQIAVTGGGSLYRRRFGIFLGMFCIIGDFNYGSDLMMQSSSSDVNEFVINKNYYVFFYGGIVFNVHHVFVSQLDLIEMVWAGGFMLHVICYSSVFFKKTVIFLYKLLLRRY
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
>C_i00001_4

"Result file with unexpected blanck lines with ------"

>C_c04661_3
RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
--
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
--
>C_i00001_6
SIFLIIIYTKRLQFFFKKKPKKSNKFTHVITNPQPKPSLLDLIEKQKHDEHTLFNHHKRKPNNSFSNQIHPNPMKTKTASSNPIHNINRQYKTYPRIFQIFAGTNSHHRLPRSALTLSTSSSCLRSPITLILSGFAKWIATTGGYFWRRRFDRRSYLACSVASSSVGSASINGIDWFKGFEWVNVDRLITSCSSVSFFLGLFSFLFFFLTQLMLSQEPIVPYHTRALSSKRCYRSLTSHGLSLSLSLSLSLSSQSELKWLRRFPYQTHQSLSGQEMAPFQCHCVMIQLQMLVSRFVGLDIEVLSKTVGLFVFMVYSEERRTIMIRVKKPLQRQEFLETCRIFMRLRRHKWLSKLKQYECRKNLLQQSLMVTVKANRHYLGTSNLCAQRSLKLQWSDQKNFLFSPRHTKMHTRKVSWLRREVILRRVECLKASMMMRSNYFPSFTCLILNVNFVLKPDISANSVRAISALFNLLFEQERFLNLYISGLNTNKMLHICFTTLNHLERTHYSGCEQGKCVNVFILSWLFILTDIFIYFCHLISESIYLV
--
>C_i00002_4
VSSFIQKDYSFFLKKNLRRVTNLHMSHETPSPNHLYILRNKNMMNIEHYSTTIKENLIIVFLINNKFINLILRRRLHHQIRSIIKLIANNTKHTQEYSKSSPVQTPTTGYRDLRLLLVLVVVVFVLRLLLFLVDLPNGLHKPQADTFGEGGSIAGATLPVRPRPPLDRLLSTVLIGLRDLSGSTSTVSPPAPPVFSWACFLSCSSFLSCSSPKNLSSHTIPDEPCLLNDVTDLCKPAMASLSLSLSLSLSLSLSLLVHRQNYIYDKPQLYQFSFYSYRARQNMNNNYHFVLCTHVLSIPIPPSPTLLLQFNKRKALHIFSVLIMIIVSFIILVLPLDLLVPGLLIHSLVWILLIFIKIRDIEVRRNLLHLYTRSMLNVTKILQHLHFDCTKIRFRVCIIYYMPMWNLQIFWPKIFNVIVVGDLVWKLCVKNSSFNSPTPSNILFCVSATSSNQGQVEFLHKLNTLSMTINGKIEAAAISSICSTLEDYDTWS
--
>C_i00007_4
LSLSLSLSLFSIRTLKMASTIPLSNPPVTEWRRNGSISMSLCNVNSTANVVGKQISWSGYRSFKQNRRSLCVYGLFGGKKDNNDKGEEAPAKAGIFGNMQNLYETVKKAQMVVQVEAVRVQKELAAAEFDGYCEGELIKVTLSGNQQPVRTEITEAAMELGPEKLSLLVTEAYKDAHQKSVLAMKERMSDLAQSLGMPQGLNVNDALKLFPLFYLSNIKCQFCTQTRHKCLCKGDKLSIVFTLTGEVFELIYLWSQYMNVAYLFHYFEPLRKNTLWMTGKMCQCLHPLLVVHFDLDFYLFLSFNKIIDLFG
--
>C_i00008_3
EREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDTLERERERERERERDFFQILAMEGFDGYKPAMAMVGLQCIYTGLALFTRAA
--
>C_i00009_4
VRTYVRTYVRTYVRTTTTTTTTTTLSLSLSLSLSLSLSLSLSLYLSLSLSHNFLVTLLSVLLLTTTSSSDLAKLYILIMKQVVLKLDFHDDRTKKKIMKTVSGHSGIDSISMDSKDMKLTVTGDIDPVSLVSKLRKLCNAEILSVGPPKAPEKKKEEAKKEEPKKQEPKKDELTELQKIWIAHQNAQMVSRPQPQYFVRSVEEDPNACVICAFIDCCDLPSRDVNFFNVGLGELMEGRLICFILFYFINSFELIIIVCLIFIYNSLFP
--
>C_i00011_2
IHSNKNYHDVRTYFVDLNNLHLNLYRLSNVKFIEYFTKNKRKRIEKIQPISTNPILKQITYFYKNQNPKKRKFKDLNSGFFGRFWFGCRFLRFSRLSLLRQNWVNVGKNTTAGDCNTVKQFPQFLIVPHSQLNVSRVDSSLLVVPGSISGQFQNFSGEVFKNGSVDGSTGTSTLGVSSLLEESSDTTHGKLKSSLDGLSDRLLPVSAFPSSGSLGSSLGFCSFHCNEIWKLFSETIRFEFWREQKFVEFVDLV
--
>C_i00012_5
SQISSNSKLNKLLFPPKLESNRFREKLPNFVSAMETTKSTKGGAKGAGGRKGGDRKKSVTKSVKAGLQFPVGRIARFLKKGRYAQRTGTGAPVYLAAVLEYLAAEVLELAGNAARDNKKTRINPRHVQLAVRNDEELGKLLHGVTIASGGVLPNINPVLLPKKTKSAESEKPATKSPKSPKKAVVFKFPFFWVLVLVEICNLFKNGICTNRLDLFNPFSFVLGKIFNEFYLILSLPVQSIMQIILQIYKICSHIMIIFVGMN
--
>C_i00013_6
NQQRITCFPSISNFFKNSNLFTENLLRSFSNGNYKSNQGRSQGSRRKERRRQEEVGDVRQGWTSVPRGSYRSIPQEGKIRSTYWYRCSRLPCCCSIPRRRGFGVGRKCCSQQEDNQPTRSIGCEERGIREVASRCYNRQRWCSSQHPSFATKEDQVCIETCNQITQISQKSLSLGLISFFLGFGSCRNMFVEWDLYVGSFQSFFFCSWNIQILPNFIMIITCTVNLDANYFTDLQNMFSHHDNFCWNE

Can anybody tells me what I am doing wrong with the syntax /input file/id file?

Thanks

grep fasta • 1.4k views
ADD COMMENTlink modified 3.7 years ago by Matt Shirley8.9k • written 3.7 years ago by tcf.hcdg60
2
gravatar for Pierre Lindenbaum
3.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

It's because you used '-A 1 ' see http://stackoverflow.com/questions/2168065 "How do I get rid of “--” line separator when using grep with context lines?"

ADD COMMENTlink written 3.7 years ago by Pierre Lindenbaum119k

Yes, you can use --no-group-separator to remove these extra lines. Contrary to what is mention in the thread http://stackoverflow.com/questions/2168065/how-do-i-get-rid-of-line-separator-when-using-grep-with-context-lines, this is not an undocumented option. It is not described in man grep, but it is described in info grep.

ADD REPLYlink written 3.7 years ago by Frédéric Mahé2.9k

Thanks it works

grep -v "^--" result.fa > finalresult.fa

 

ADD REPLYlink written 3.7 years ago by tcf.hcdg60

i have a space charcter at the end of IDs. Iwant to get rid of these spaces. 

I used this sed function

sed -i 's/ *$//' id.txt

but its not working.

any suggestion?

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by tcf.hcdg60

"It's not working" is not that helpful. Perhaps give a description of what behavior you expect and what your command is actually doing. From what I can guess, your issue is that "*$" removes every line in your input file and leaves you with a blank output. You're also using the  -i flag which overwrites your existing file. It also appears you have a backtick (`) instead of a single quote starting your expression. Try matching a space before the end of the line explicitly:

sed 's/ $//' < id.txt > id_nospace.txt
ADD REPLYlink written 3.7 years ago by Matt Shirley8.9k

As you said issue was "*$". I tried the code without backtick (') and now it is giving me the desired output. 

sed s/" "// < id.txt > idfinal.txt

As I was expecting the code remove all the blanck spaces from the id.txt file and stores in a new file idfinal.txt

 

Thanks for the help. 

 

ADD REPLYlink written 3.7 years ago by tcf.hcdg60
1
gravatar for Matt Shirley
3.7 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

You can use a tool like "samtools faidx" or pyfaidx to do this:

$ (sudo) pip install pyfaidx
$ xargs faidx input.fasta < ids.txt > output.fasta
ADD COMMENTlink written 3.7 years ago by Matt Shirley8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1081 users visited in the last hour