Question: Extract FASTA if two matches
0
gravatar for zoegward
19 months ago by
zoegward60
zoegward60 wrote:

Hi, I have a fasta file and I want to only extract the sequences and the part of the header before and after the '|' (eg P00533 EGFR_HUMAN Epidermal growth factor receptor) if OS=Homo Sapiens. Is this possible quth a quick awk one liner??

    >sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
    MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
    VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
    VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
    QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
    TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
    VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
    NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
    >sp|P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase OS=Homo sapiens GN=AKT1 PE=1
    MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
    QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
    RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
    >tr|P91634|P91634_DROME PI-3 kinase OS=Drosophila melanogaster GN=Pi3K92E PE=1 SV=1
    MNMMDNRALAYVAHQPKYETPPEEAEPPCMRFSVNLWKNEMLNWVDLICLLPNGFLLELR
    VNPANTIQVIKVEMVNQAKQMPLGYVIKEACEYQVYGISTFNIEPYTDETKRLSEVQPYF
    GILSLGERTDTTSFSSDYELTKMVNGMIGTTFDHNRTHGSPEIDDFRLYMTQTCDNIELE
sequence • 566 views
ADD COMMENTlink modified 19 months ago by cpad011211k • written 19 months ago by zoegward60

Thank you very much Kevin. I'm new to bash, I understand from your code how you separate the line and only print either side of the '|' with the 'split(a[1], b, "|"); print ">"b[2]" "b[3]}'. Can you explain what the bprint is doing??

ADD REPLYlink written 19 months ago by zoegward60

Hey Zoe, could you repost this as a comment to my answer? Just to maintain fluidity of the thread. You can delete this and then re-post above. I have answered your comment already there!

ADD REPLYlink written 19 months ago by Kevin Blighe43k

Awesome, thank you Kevin!

ADD REPLYlink written 19 months ago by zoegward60

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 19 months ago by WouterDeCoster39k
0
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe43k
Republic of Ireland
Kevin Blighe43k wrote:

You can try this (assumes your data is in 'test.fasta'):

awk '{if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]}; if (/^>/ && !/OS=Homo sapiens/) {bPrint=0}; if (!/^>/ && bPrint==1) print $0}' test.fasta

>P00533 EGFR_HUMAN Epidermal growth factor receptor
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749 AKT1_HUMAN RAC-alpha serine/threonine-protein kinase
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENTlink written 19 months ago by Kevin Blighe43k

Hey Zoe, bPrint is just a boolean flag that I created myself.

I'll break the command into pseudo code:


Matching header

if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]};

If the line begins with '>' AND contains 'OS=Homo sapiens', set bPrint to 'TRUE' and then split/print the header out


Non-matching header

if (/^>/ && !/OS=Homo sapiens/) {bPrint=0};

If the line begins with '>' AND does NOT contain 'OS=Homo sapiens', set bPrint to 'FALSE'


Line of sequence following a matching header

if (!/^>/ && bPrint==1) print $0}'

If the line does NOT begin with '>' AND bPrint is 'TRUE', print the line


If there are empty lines, it will just print them.

ADD REPLYlink modified 19 months ago • written 19 months ago by Kevin Blighe43k
0
gravatar for cpad0112
19 months ago by
cpad011211k
India
cpad011211k wrote:

with seqkit and sed:

$ seqkit grep -nrp  "OS=Homo sapiens" test.fa | sed '/^>/ s/.*|\(P.*\)OS.*/>\1/g'

>P00533|EGFR_HUMAN Epidermal growth factor receptor 
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase 
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENTlink modified 19 months ago • written 19 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1072 users visited in the last hour