Entering edit mode
9.3 years ago
tcf.hcdg
▴
70
Hello
I have a multi fast file and want to extract some of the sequence from big multifasta file. I tried the grep function in linux. It gives the out put results.
HP-Pavilion-dv6-Notebook-PC ~/bioinformatics/tu/fastaretrival $ grep -A1 -wFf id.txt input.fasta > result.fa
But there are unexpected blanks in the result file with -- almost after every sequence.
id
file is as follow
>C_c04661_3
>C_i00001_3
>C_i00001_6
>C_i00002_4
>C_i00007_4
>C_i00008_3
>C_i00009_4
>C_i00011_2
>C_i00012_5
>C_i00013_6
The input file is as follows
>C_c04661_2
KKETFHSIQNVWIDNNHLTETVKFYENAVHEKRIRIFLELISHTIRLACSILVQMAFLPHHSSFLELQSLWSLTCVIQISHIPLCSSSGKDILHSLPQCFHPSHSSHCGLLLGDLKRSNQHASMDLFPVVGSPQHPESHLLPFFSSLPMLVQPQQVCVVPYHCHTNTVLLPVDYPEHPNPSVSPFLLTFPRLLWQDPLHSQKEQKFQIHLHLCTRFWGSILSHFPTLGSLQNISCSNHLSLSVEELEQQEGPEELKEISHHECAMRHLHLITLSCAPQCEPNPYPCIQHPPSIYDHQPWSQSHQHHPSHLIQEFQFHWQARQHQQQVSQGKKCCLFPGSVHTCVRHQKDSRSPYNQSPLSLCTEACSTQQMAPSFHHQHGNHKEQSFAQIPQNNIYELPPSYQPVQGGGLIGQVAINPLSFSIQKVSTVKDCAVLYPWLFRKMQIEPKRPYPTKKEIGFYIYVCVYASARHREIMYPNMCTAVEKKIIIIKK
>C_c04661_3
RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
>C_c04661_4
FFIIIIFFSTAVHIFGYIISLQCLAEAYTHTYIKPISFLVGYGLLGSQICIFLKSQGYSTAQSLTVDTFCIENERGFMATWPINHPPPCTGEGGSSILFCGIAQKDCSLFPCCWWKDGAICWVEHAYSVQRLNHGLYCQLGERESFCLTHVWTEPQGKRQHFFPETCYCCCLACQWNWNSYCIYKEGWCYLCDHQGYSYILGGQCWMQGGFGSHGAHESVQIRWRCLMAHSEISFSSSGPSCCSNSSTLNERFEQDIFWRLPSVGNQWLKMLPHQNRVHRRWINFCSFECSGSCQSNLGKVRRKGETDGFGCSGSTGSRTVFVWQYGTTHHTCGWTNMGYREEKKGRRWLSGCGDPQTTGNRSIDQACWFDLFRSPSKRPQDEEGKHWGSHEWRISLPEEEHSYGICEIWITQVRDQSDCSSRKELWGYHKNAICTYKILQANLIVWLISSRNILIRFSWTAFSNLTVSVKCFQLSIYHTFMENVSFF
>C_c04661_5
FFYYYYFFFYCSTHIWIHYLSSVPCRGIYTHIYIKANLFLGGIWSFGLSDLHFSEKPRIQYCTILNSRYFLYKEGVYGNLANQSPAALHRLIRRRLVDIILRNLSSKRLFFMISMLLMVERWCHLLGGACLFSTKTQSWTLLSIVGRTGVFLMSHTCVDASREKTAFLSLRNLLLLMLPRLPMELELLLYLQMRRVVLLMTLSPRLLMIIYTWWSMLDAGIRIWLTLRSTKCSNLKVEMSHGTFMMRDLFLFRALLLLQLFNTKMIARYILETSECWKSMAQNASSPEPGTQVKMDLKFLFLLRMQWILPKQSWKSQKERDRVWVLGIVLDWKQDCVCMAMIWNNTSHLLRLDHGLGREEGQKVAFWVLRSSNNWKQVHRSGVLVSLQVPQQEATVRLRMRRVKTLGKSRVEDLVLARRTLWDMNLDHTSQGPKLRLQFEERTMMGLSQKCHLYLQNTTSQPDLGVAYQFQKYSNSFLMDSIFIKLDCFSQMLISIVYLPHILNGMKCFFL
>C_c04661_6
FLLLLFFFLLQYTYLDTLSLFSALQRHIHTHIYKSQSLSWWDMVFWALRFAFFKAKDTVLHNPQSILFVLKMRGGLWQLGQSITRRLAQADKKAVARRYYSAESELKKTVLYDFHVANGGKMVPFAGWSMPIQYKDSIMDSTVNCRENGSLFDVSHMCGLSLKGKDSISFLEKLVIADVASLANGTGTLTVFTNEKGGAIDDSVITKVTDDHIYLVVNAGCRDKDLAHIEEHMKVFKSKGGDVSWHIHDERSLLALQGPLAAPTLQHLMKDDLSKIYFGDFRVLEINGSKCFLTRTGYTGEDGFEISVPSENAVDLAKAILEKSEGKVRLTGLGARDSLRLEAGLCLYGNDMEQHITPVEAGLTWAIGKRRRAEGGFLGAEVILKQLETGPSIRRVGLISSGPPARGHSEIKNEKGENIGEVTSGGFSPCLKKNIAMGYVKSGSHKSGTKVKIAVRGKNYDGAITKMPFVPTKYYKPTFRCGLSVPEIFFVSHGQHFHKTLFQSNANFNCLFTTHSEWNEMFLSS
>C_i00001_1
TKIYDSLIKQKIKISVKMNNQERMKTLTHFPCSHPLCVLSKWFKVVKQICNIHLLVLRPEIYKFKNLSCSKSKLNNALIALTELALMSGLSTKLTFNIRQVKEGKLLQRIIYIEALRHSTLRKITHSLLHSQDTFLVCIFVCLGDKRKFFWSLHCSFSDLCAHRLLVPRCYLYFAFTVTIKLCCSKFFLHSYCFNLDNHLCLLHSLIKILHVSKNSCLCRGFFTLIIIVLLSSETINTKRPTVLLKTSISRPTNLLTNYICSIYITQHNGAISCPLSDWWVGNRRSHFQSSDEERERERERERERPWLVYRDLHRLEDKARQVWYGTIGSWESYINVKKKNKKENKPKKKLTEEQEVIRRSTLTHSNPLNQSIPLIEADPTEDEATEQARLRRSNRLRQKYPPVVYAIHLANPLRIRVIGERRQLLLVLRVSADRGNRWWEFVPAKIWNILGYVLYYWRLILWIGFDDAVFVFIGLGICYLEKLLLGFLLWWLNSVQCSSCFCFSIRSNRDGLGWGFHVITCVNLLLFLGFFLKKNCNLFVIIIKKIL
>C_i00001_2
PNKSMIHLLNDKNKKSNQSKTTRRGRHHIFPVHIHYSVFFLSGSKNKYATFIYYDQRYISSKTSPVQRVNTMLNLSPLQSHLCLVVQNHLILDKKRGNNYFSASFTLRPGIPKLCARSLILSFIAKTLFWCASLYASVTRRESFSGPNSIAASVISVRTGCWFPDSVTFISSPSQPSNSAAASSFCTRTASTWTTICAFFTVSRFCMFPKIPAFAGASSPLSLLSFFPPNRPTQRDLRFCLKLRYPDQQICLPTTFAVEFTLHSDIEMEPFLVHSVTGGFDRGIVEAIFRVRIEKRERERERERERGHGWFTEICNIVKTRLVRYGMGRVLGRATSTESKRRTRKKTSPRKNSRRSRRSDGRRPTQIPTNQYRKPIQRRTRLPNRQGSSGDRTAFAKSIRLWFMQSIWQIHEESENEDNYYYEAQIAVTGGGSLYRRRFGIFLGMFCIIGDFNYGSDLMMQSSSSDVNEFVINKNYYVFFYGGIVFNVHHVFVSQLDLIEMVWAGGFMLHVICYSSVFFKKTVIFLYKLLLRRY
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
>C_i00001_4
Result file with unexpected blank lines with ------
>C_c04661_3 RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
--
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
--
>C_i00001_6
SIFLIIIYTKRLQFFFKKKPKKSNKFTHVITNPQPKPSLLDLIEKQKHDEHTLFNHHKRKPNNSFSNQIHPNPMKTKTASSNPIHNINRQYKTYPRIFQIFAGTNSHHRLPRSALTLSTSSSCLRSPITLILSGFAKWIATTGGYFWRRRFDRRSYLACSVASSSVGSASINGIDWFKGFEWVNVDRLITSCSSVSFFLGLFSFLFFFLTQLMLSQEPIVPYHTRALSSKRCYRSLTSHGLSLSLSLSLSLSSQSELKWLRRFPYQTHQSLSGQEMAPFQCHCVMIQLQMLVSRFVGLDIEVLSKTVGLFVFMVYSEERRTIMIRVKKPLQRQEFLETCRIFMRLRRHKWLSKLKQYECRKNLLQQSLMVTVKANRHYLGTSNLCAQRSLKLQWSDQKNFLFSPRHTKMHTRKVSWLRREVILRRVECLKASMMMRSNYFPSFTCLILNVNFVLKPDISANSVRAISALFNLLFEQERFLNLYISGLNTNKMLHICFTTLNHLERTHYSGCEQGKCVNVFILSWLFILTDIFIYFCHLISESIYLV
--
>C_i00002_4
VSSFIQKDYSFFLKKNLRRVTNLHMSHETPSPNHLYILRNKNMMNIEHYSTTIKENLIIVFLINNKFINLILRRRLHHQIRSIIKLIANNTKHTQEYSKSSPVQTPTTGYRDLRLLLVLVVVVFVLRLLLFLVDLPNGLHKPQADTFGEGGSIAGATLPVRPRPPLDRLLSTVLIGLRDLSGSTSTVSPPAPPVFSWACFLSCSSFLSCSSPKNLSSHTIPDEPCLLNDVTDLCKPAMASLSLSLSLSLSLSLSLLVHRQNYIYDKPQLYQFSFYSYRARQNMNNNYHFVLCTHVLSIPIPPSPTLLLQFNKRKALHIFSVLIMIIVSFIILVLPLDLLVPGLLIHSLVWILLIFIKIRDIEVRRNLLHLYTRSMLNVTKILQHLHFDCTKIRFRVCIIYYMPMWNLQIFWPKIFNVIVVGDLVWKLCVKNSSFNSPTPSNILFCVSATSSNQGQVEFLHKLNTLSMTINGKIEAAAISSICSTLEDYDTWS
--
>C_i00007_4
LSLSLSLSLFSIRTLKMASTIPLSNPPVTEWRRNGSISMSLCNVNSTANVVGKQISWSGYRSFKQNRRSLCVYGLFGGKKDNNDKGEEAPAKAGIFGNMQNLYETVKKAQMVVQVEAVRVQKELAAAEFDGYCEGELIKVTLSGNQQPVRTEITEAAMELGPEKLSLLVTEAYKDAHQKSVLAMKERMSDLAQSLGMPQGLNVNDALKLFPLFYLSNIKCQFCTQTRHKCLCKGDKLSIVFTLTGEVFELIYLWSQYMNVAYLFHYFEPLRKNTLWMTGKMCQCLHPLLVVHFDLDFYLFLSFNKIIDLFG
--
>C_i00008_3
EREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDTLERERERERERERDFFQILAMEGFDGYKPAMAMVGLQCIYTGLALFTRAA
--
>C_i00009_4
VRTYVRTYVRTYVRTTTTTTTTTTLSLSLSLSLSLSLSLSLSLYLSLSLSHNFLVTLLSVLLLTTTSSSDLAKLYILIMKQVVLKLDFHDDRTKKKIMKTVSGHSGIDSISMDSKDMKLTVTGDIDPVSLVSKLRKLCNAEILSVGPPKAPEKKKEEAKKEEPKKQEPKKDELTELQKIWIAHQNAQMVSRPQPQYFVRSVEEDPNACVICAFIDCCDLPSRDVNFFNVGLGELMEGRLICFILFYFINSFELIIIVCLIFIYNSLFP
--
>C_i00011_2
IHSNKNYHDVRTYFVDLNNLHLNLYRLSNVKFIEYFTKNKRKRIEKIQPISTNPILKQITYFYKNQNPKKRKFKDLNSGFFGRFWFGCRFLRFSRLSLLRQNWVNVGKNTTAGDCNTVKQFPQFLIVPHSQLNVSRVDSSLLVVPGSISGQFQNFSGEVFKNGSVDGSTGTSTLGVSSLLEESSDTTHGKLKSSLDGLSDRLLPVSAFPSSGSLGSSLGFCSFHCNEIWKLFSETIRFEFWREQKFVEFVDLV
--
>C_i00012_5
SQISSNSKLNKLLFPPKLESNRFREKLPNFVSAMETTKSTKGGAKGAGGRKGGDRKKSVTKSVKAGLQFPVGRIARFLKKGRYAQRTGTGAPVYLAAVLEYLAAEVLELAGNAARDNKKTRINPRHVQLAVRNDEELGKLLHGVTIASGGVLPNINPVLLPKKTKSAESEKPATKSPKSPKKAVVFKFPFFWVLVLVEICNLFKNGICTNRLDLFNPFSFVLGKIFNEFYLILSLPVQSIMQIILQIYKICSHIMIIFVGMN
--
>C_i00013_6
NQQRITCFPSISNFFKNSNLFTENLLRSFSNGNYKSNQGRSQGSRRKERRRQEEVGDVRQGWTSVPRGSYRSIPQEGKIRSTYWYRCSRLPCCCSIPRRRGFGVGRKCCSQQEDNQPTRSIGCEERGIREVASRCYNRQRWCSSQHPSFATKEDQVCIETCNQITQISQKSLSLGLISFFLGFGSCRNMFVEWDLYVGSFQSFFFCSWNIQILPNFIMIITCTVNLDANYFTDLQNMFSHHDNFCWNE
Can anybody tells me what I am doing wrong with the syntax /input file/id file?
Thanks
Yes, you can use
--no-group-separator
to remove these extra lines. Contrary to what is mention in the thread http://stackoverflow.com/questions/2168065/how-do-i-get-rid-of-line-separator-when-using-grep-with-context-lines, this is not an undocumented option. It is not described inman grep
, but it is described ininfo grep
.Thanks it works
I have a space character at the end of IDs. I want to get rid of these spaces. I used this sed function:
but its not working. Any suggestion?
"It's not working" is not that helpful. Perhaps give a description of what behavior you expect and what your command is actually doing. From what I can guess, your issue is that "*$" removes every line in your input file and leaves you with a blank output. You're also using the
-i
flag which overwrites your existing file. It also appears you have a backtick (`) instead of a single quote starting your expression. Try matching a space before the end of the line explicitly:As you said issue was
*$
. I tried the code without backtick (`) and now it is giving me the desired output.As I was expecting the code remove all the blanck spaces from the id.txt file and stores in a new file idfinal.txt
Thanks for the help.