Question: Extract FASTA if two matches
0
gravatar for zoegward
2.2 years ago by
zoegward90
zoegward90 wrote:

Hi, I have a fasta file and I want to only extract the sequences and the part of the header before and after the '|' (eg P00533 EGFR_HUMAN Epidermal growth factor receptor) if OS=Homo Sapiens. Is this possible quth a quick awk one liner??

    >sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
    MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
    VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
    VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
    QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
    TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
    VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
    NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
    >sp|P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase OS=Homo sapiens GN=AKT1 PE=1
    MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
    QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
    RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
    >tr|P91634|P91634_DROME PI-3 kinase OS=Drosophila melanogaster GN=Pi3K92E PE=1 SV=1
    MNMMDNRALAYVAHQPKYETPPEEAEPPCMRFSVNLWKNEMLNWVDLICLLPNGFLLELR
    VNPANTIQVIKVEMVNQAKQMPLGYVIKEACEYQVYGISTFNIEPYTDETKRLSEVQPYF
    GILSLGERTDTTSFSSDYELTKMVNGMIGTTFDHNRTHGSPEIDDFRLYMTQTCDNIELE
sequence • 685 views
ADD COMMENTlink modified 2.2 years ago by cpad011212k • written 2.2 years ago by zoegward90

Thank you very much Kevin. I'm new to bash, I understand from your code how you separate the line and only print either side of the '|' with the 'split(a[1], b, "|"); print ">"b[2]" "b[3]}'. Can you explain what the bprint is doing??

ADD REPLYlink written 2.2 years ago by zoegward90

Hey Zoe, could you repost this as a comment to my answer? Just to maintain fluidity of the thread. You can delete this and then re-post above. I have answered your comment already there!

ADD REPLYlink written 2.2 years ago by Kevin Blighe53k

Awesome, thank you Kevin!

ADD REPLYlink written 2.2 years ago by zoegward90

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k
0
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe53k
Kevin Blighe53k wrote:

You can try this (assumes your data is in 'test.fasta'):

awk '{if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]}; if (/^>/ && !/OS=Homo sapiens/) {bPrint=0}; if (!/^>/ && bPrint==1) print $0}' test.fasta

>P00533 EGFR_HUMAN Epidermal growth factor receptor
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749 AKT1_HUMAN RAC-alpha serine/threonine-protein kinase
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENTlink written 2.2 years ago by Kevin Blighe53k

Hey Zoe, bPrint is just a boolean flag that I created myself.

I'll break the command into pseudo code:


Matching header

if (/^>/ && /OS=Homo sapiens/) {bPrint=1; split($0, a, " OS="); split(a[1], b, "|"); print ">"b[2]" "b[3]};

If the line begins with '>' AND contains 'OS=Homo sapiens', set bPrint to 'TRUE' and then split/print the header out


Non-matching header

if (/^>/ && !/OS=Homo sapiens/) {bPrint=0};

If the line begins with '>' AND does NOT contain 'OS=Homo sapiens', set bPrint to 'FALSE'


Line of sequence following a matching header

if (!/^>/ && bPrint==1) print $0}'

If the line does NOT begin with '>' AND bPrint is 'TRUE', print the line


If there are empty lines, it will just print them.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe53k
0
gravatar for cpad0112
2.2 years ago by
cpad011212k
India
cpad011212k wrote:

with seqkit and sed:

$ seqkit grep -nrp  "OS=Homo sapiens" test.fa | sed '/^>/ s/.*|\(P.*\)OS.*/>\1/g'

>P00533|EGFR_HUMAN Epidermal growth factor receptor 
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
>P31749|AKT1_HUMAN RAC-alpha serine/threonine-protein kinase 
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQC
QLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDF
RSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKI
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour