Question: Perl#replace a sequence name in fasta file with another name
0
gravatar for ashaneev07
11 months ago by
ashaneev0710
ashaneev0710 wrote:

i want to replace name of the sequences in fasta file after '>' sign with the protein id within the header line.i wrote a perl script but it only prints the firstline of the sequence itself(given below).

>XP_020088267.1
ATGGCTGATGCTGAGGATATTCAGCCACTCGTCTGTGACAATGGAACTGGAATGGTGAAGGCTGGATTTGCTGGTGATGA
>XP_020087759.1
ATGGTTGTTTCTAGCTCACCTAAAACTGCATATGATGCTTGGAGAATTATTGAGGCATATTTTCTTGATAAGACTGCTTC
>XP_020089204.1
ATGACGGCGACGACTTCTGACGCTATGGAGGAGGATCCGGCGCCGTCTTCTGATATGACGGAGGAGGAGGGGGGCGACCC

Here is the script;

my ($data,$seqline);
$infile=$ARGV[0];
open(IN,"$infile") || die "Can't open input file";
while($data=<IN>)
{
    chomp $data;
    $seqline=<IN>;
    if($data=~s/^>//g)
    {

        my ($protn_id)= (split "=",$data)[4];
        print ">$protn_id\n","$seqline\n";
    }

}
close IN;
sequence • 433 views
ADD COMMENTlink modified 11 months ago by JC8.9k • written 11 months ago by ashaneev0710
1

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 11 months ago by WouterDeCoster41k

Thanks a lot... Please help me on my program

ADD REPLYlink written 11 months ago by ashaneev0710
1

Is there a reason you aren’t just using Bio::Perl?

ADD REPLYlink written 11 months ago by Joe14k

Hii.. I'm a beginner in Perl scripting and scripting in general.Also i'm not familiar with bioperl modules. that's why..

ADD REPLYlink written 11 months ago by ashaneev0710

It's generally easier to use something that already exists, rather than reinventing the wheel.

ADD REPLYlink written 11 months ago by WouterDeCoster41k

I'm biased, but Perl is also not the easiest language to start with as a beginner (consider python).

Can you show us your input data too (just 3 or 4 sequences would do). It's not clear where the headers are coming from (I'm guessing they're already in the fasta and you're just deleting part of the header judging by your perl substitution.

ADD REPLYlink written 11 months ago by Joe14k

Here with the example of my input file..

>lcl|NC_033621.1_cds_XP_020098752.1_1 [gene=LOC109717392] [db_xref=GeneID:109717392] [protein=ervatamin-B-like] [protein_id=XP_020098752.1] [location=join(31068..31412,32757..32880,32987..33597)] [gbkey=CDS]
    ATGGTTGCCACTAAGTTGATGATGACCTCTTTAATCTTAGTTCAACTGTGGGTGCTTATGCCACTGATGGCGTGTGGTAC
    AACGTTAGATCCCATGAGAGAGAGGTATGAACAATGGATTAGCCGATATAGCCGAGTCTACAAGGATAAGAACGAGAAGG
    AGTGGCGCTTTAGGATATATGAATCCAACGTCCAGCTCATCAACATCTTTAATACCATTAGTGAGGAGTACAAGCTCATT
    GACAACAAGTTTGCCGACCTAACTAGTGAGGAGTTCAAGGCCAAGTCTGTTTGCTTAAGGGATCTCCGTAATCATCGTCC
    GCCTTCTCGACAGTCGCAGCAGTAGAAGGCATCAACAAAATTAAGGCGGGTAGATTGGTAG
    GTCAGTAGCAATAGATGCTGGGGGTTTCGCCTTCCAGTTCTACTCAAAGGGCATCTTCACC
    GATTGCCATGAAGCCTTCCTATCCTCTCAAGATAGATTAA
    >lcl|NC_033621.1_cds_XP_020098740.1_2 [gene=LOC109717385] [db_xref=GeneID:109717385] [protein=protein STRICTOSIDINE SYNTHASE-LIKE 3-like] [protein_id=XP_020098740.1] [location=join(38611..38948,39167..39445,40690..40857,40946..41339)] [gbkey=CDS]
    ATGGCGGTGCGGGCAGCTGGTCTGCTGGCAGCGTTGGTTGTGGGTTTGGCCGCGGTGTACTGCGCGATGGACCCCCTACG
    GCTGAGCGCCGTAGCCGACTTCCCGGGCTTCGAGAGCCATCCCGTAGAGCTTCCCCCTTGGTCGGAGCTGCCGGCCGCCA
    GGGACGCCGAGGATCGGCTGCGGAGAGCGGAGATCCGCTTCCTGAACCAGGTGCAGGGCCCCGAGAGCATCGCCTTCGAC
    GTACGGCCCAGAGGGGGAGTTGCTTGAAATTCTGGAAGACCGGCAGGGGAAGGTTGTTAGGGCAGTTAGCGAAGTCGAAG
    AGAAGGATGGGAAGCTTTGGATAGGATCAGTGCTCATGCCATTTATTGCCGTTTATTGA
    >lcl|NC_033621.1_cds_XP_020084954.1_3 [gene=LOC109707836] [db_xref=GeneID:109707836] [protein=uncharacterized protein LOC109707836] [protein_id=XP_020084954.1] [location=complement(44128..44742)] [gbkey=CDS]
    ATGCGAGCTCGTGCAGAGTCGATAATGGGCGGCGATCAGGGGAAGACAGCAATGGCGCAGAGGAAGCTTCTTCTCCGTGG
    TCCGACGGCACTCGCCCAACGGCAGCCGGATGCCGCCTCTGCCTCAAAGCCACTGGGCCGGCGGCGGATAGCGGAGATGG
    CTGGGGAGACGGCAGCGGAGTGTGCGGCCATCTGGTGCTGCTGTCCCTGCGGCCTCCTCAACCTCGTCGTCCTCGCCCTC
    GTCAAACTTCCCGCCGGGCTCGTCATCCGCGCTCTGCGCCGCCAAAGGCGAGTGATGAGGAAGCACAGCAGAGGAGGAGC
    CGCTGCCCTGAGGCTGCTGGGAGGGCCCAAAGGCAAAGGCAGAGGCAGAGGCTCAACTGGCGAAGGCGAAGGCGAAGGCG

and the output should be like the following.. thanks.

   >XP_020086305.1
ATGGTGATGGCCTCAGCTCCATTATCTCGCGGAGAAGATGAAATAGACATAGAACTTCAAGTTCACTGCTTGGAGACAAAAGCGTATAGTGCTGTTTTAAGAGCATTCAATGCTCAAT
CTCAAGATAGACAGCCACGAAGCATGGCAAAACCTGGAGGTTTGCTGGTTGTACGGACATCTAAGAAGTGTTCTGTTCCTTCCGGGACTGATAGCTTGAAACCTAGACCAGATGTAAT
CGAGATCCGTGCAACTGATGAACTTATCAATGAGATCGAGAAATTCTGTCGAGAGAACCCAGACCGAGCTAATGTGAGAGGGCAAAATCAATCCTCAGAAGCTAGAAAGGAACGGACG

>XP_020086314.1
ATGGTGATGGCCTCAGCTCCATTATCTCGCGGAGAAGATGAAATAGACATAGAACTTCAAGTTCACTGCTTGGAGACAAAAGCGTATAGTGCTGTTTTAAGAGCATTCAATGCTCAAT
CAGATGTTCTTTCATGGGAGAAGGCAGAACTTATATCAGATATAAGGAAGGGACTCAGAATTTCTAGTGCTGAGCACAAGGAAATTCTCAGAAAAATTAGTTCTGATGAATTGGTCAA
CTCAAGATAGACAGCCACGAAGCATGGCAAAACCTGGAGGTTTGCTGGTTGTACGGACATCTAAGAAGTGTTCTGTTCCTTCCGGGACTGATAGCTTGAAACCTAGACCAGATGTAAT
CGAGATCCGTGCAACTGATGAACTTATCAATGAGATCGAGAAATTCTGTCGAGAGAACCCAGACCGAGCTAATGTGAGAGGGCAAAATCAATCCTCAGAAGCTAGAAAGGAACGGACG

>XP_020086322.1
ATGGTGATGGCCTCAGCTCCATTATCTCGCGGAGAAGATGAAATAGACATAGAACTTCAAGTTCACTGCTTGGAGACAAAAGCGTATAGTGCTGTTTTAAGAGCATTCAATGCTCAAT
CAGATGTTCTTTCATGGGAGAAGGCAGAACTTATATCAGATATAAGGAAGGGACTCAGAATTTCTAGTGCTGAGCACAAGGAAATTCTCAGAAAAATTAGTTCTGATGAATTGGTCA
ACTCAAGATAGACAGCCACGAAGCATGGCAAAACCTGGAGGTTTGCTGGTTGTACGGACATCTAAGAAGTGTTCTGTTCCTTCCGGGACTGATAGCTTGAAACCTAGACCAGATGTAA
ADD REPLYlink modified 11 months ago • written 11 months ago by ashaneev0710

An example of your input file and how do you want the output could be helpful.

BTW I proficient in Perl and Python, still using primarily Perl ;)

ADD REPLYlink written 11 months ago by JC8.9k
2
gravatar for fishgolden
11 months ago by
fishgolden420
fishgolden420 wrote:

I think the Error is in structures of input fasta file. Your script seems not work fine unless your fasta file is as follows, i guess.

>seq1 something=something=something=something=XP_020088267.1
ATGGCTGATGCTGAGGATATTCAGCCACTCGTCTGTGACAATGGAACTGGAATGGTGAAGGCTGGATTTGCTGGTGATGA
>seq2 something=something=something=something=XP_020087759.1
ATGGTTGTTTCTAGCTCACCTAAAACTGCATATGATGCTTGGAGAATTATTGAGGCATATTTTCTTGATAAGACTGCTTC
>seq3 something=something=something=something=XP_020089204.1
ATGACGGCGACGACTTCTGACGCTATGGAGGAGGATCCGGCGCCGTCTTCTGATATGACGGAGGAGGAGGGGGGCGACCC

Then the following script may be more robust.

my $infile=$ARGV[0];
open(IN,$infile) || die "Can't open input file";
while(my $line=<IN>){
    $line =~ s/[\r\n]//g;
    if($line=~ /^>/){
        my ($protn_id)= (split "=",$line)[4];
        print ">$protn_id\n";
    }else{
        print $line."\n";
    }
}
close IN;

BTW, Perl v5 is an old language. Perl v6 is a very minor language.

If you are a beginner, I recommend python 3 as jrj.healey says.

ADD COMMENTlink written 11 months ago by fishgolden420

Thankyou so much..your script works perfectly. ;) Even though i've merged the multifasta sequence as single sequence,then my program also worked nicely.

ADD REPLYlink written 11 months ago by ashaneev0710
2
gravatar for JC
11 months ago by
JC8.9k
Mexico
JC8.9k wrote:

That can be done using a Perl-one-liner:

$ perl -lne '(m/>.+\[protein_id=(.+?)\]/) ? print ">$1" : print $_' < seqs.fa
>XP_020098752.1
ATGGTTGCCACTAAGTTGATGATGACCTCTTTAATCTTAGTTCAACTGTGGGTGCTTATGCCACTGATGGCGTGTGGTAC
AACGTTAGATCCCATGAGAGAGAGGTATGAACAATGGATTAGCCGATATAGCCGAGTCTACAAGGATAAGAACGAGAAGG
AGTGGCGCTTTAGGATATATGAATCCAACGTCCAGCTCATCAACATCTTTAATACCATTAGTGAGGAGTACAAGCTCATT
GACAACAAGTTTGCCGACCTAACTAGTGAGGAGTTCAAGGCCAAGTCTGTTTGCTTAAGGGATCTCCGTAATCATCGTCC
GCCTTCTCGACAGTCGCAGCAGTAGAAGGCATCAACAAAATTAAGGCGGGTAGATTGGTAG
GTCAGTAGCAATAGATGCTGGGGGTTTCGCCTTCCAGTTCTACTCAAAGGGCATCTTCACC
GATTGCCATGAAGCCTTCCTATCCTCTCAAGATAGATTAA
>XP_020098740.1
ATGGCGGTGCGGGCAGCTGGTCTGCTGGCAGCGTTGGTTGTGGGTTTGGCCGCGGTGTACTGCGCGATGGACCCCCTACG
GCTGAGCGCCGTAGCCGACTTCCCGGGCTTCGAGAGCCATCCCGTAGAGCTTCCCCCTTGGTCGGAGCTGCCGGCCGCCA
GGGACGCCGAGGATCGGCTGCGGAGAGCGGAGATCCGCTTCCTGAACCAGGTGCAGGGCCCCGAGAGCATCGCCTTCGAC
GTACGGCCCAGAGGGGGAGTTGCTTGAAATTCTGGAAGACCGGCAGGGGAAGGTTGTTAGGGCAGTTAGCGAAGTCGAAG
AGAAGGATGGGAAGCTTTGGATAGGATCAGTGCTCATGCCATTTATTGCCGTTTATTGA
>XP_020084954.1
ATGCGAGCTCGTGCAGAGTCGATAATGGGCGGCGATCAGGGGAAGACAGCAATGGCGCAGAGGAAGCTTCTTCTCCGTGG
TCCGACGGCACTCGCCCAACGGCAGCCGGATGCCGCCTCTGCCTCAAAGCCACTGGGCCGGCGGCGGATAGCGGAGATGG
CTGGGGAGACGGCAGCGGAGTGTGCGGCCATCTGGTGCTGCTGTCCCTGCGGCCTCCTCAACCTCGTCGTCCTCGCCCTC
GTCAAACTTCCCGCCGGGCTCGTCATCCGCGCTCTGCGCCGCCAAAGGCGAGTGATGAGGAAGCACAGCAGAGGAGGAGC
CGCTGCCCTGAGGCTGCTGGGAGGGCCCAAAGGCAAAGGCAGAGGCAGAGGCTCAACTGGCGAAGGCGAAGGCGAAGGCG

Explanation:

  • perl -lne activates perl in execute mode and new line read
  • ( condition ) ? exec1 : exec2 this is a if-then-else abbreviated, if condition, then do exec1, otherwise do exec2
  • m/>.+\[protein_id=(.+?)\]/ this is a regular expression evaluation, looks for a line with >, then followed by any chars (.+) and looks for a string [protein_id=, the text after that will be recorded in in a match variable ($1), it is marked as any char but expanded until a ] is seen (.+?)].
  • Finally, if the match exists, it will print a new line as >$1, otherwise will print the line as read
ADD COMMENTlink written 11 months ago by JC8.9k

Thank you.. Glad to know about the perl one-liner.. :)

ADD REPLYlink written 11 months ago by ashaneev0710
0
gravatar for h.mon
11 months ago by
h.mon27k
Brazil
h.mon27k wrote:

You have an if($data=~s/^>//g), what happens when a line does not start at >? You are just missing an else and a print.

ADD COMMENTlink written 11 months ago by h.mon27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 3237 users visited in the last hour