Question: replace fasta headers with another name in a text file
3
gravatar for Jemo
5.6 years ago by
Jemo30
United States
Jemo30 wrote:

Hi everyone,

I have a fasta file and a text file with names on each row:

The fasta file looks like this:

>BQG3565;size=525 

AGGCTT.....

>BGET752;size=3

TTGCCAG.....and so on

The text file looks like this:

ANT_39

ANT_5676

ANT_3 ... and so on. 

I would like to replace each header from the fasta file with the name from each row in the text file. I am a beginner in bioinformatics and was wondering if anyone would be able to help me on this? 

Many thanks!

 

perl • 15k views
ADD COMMENTlink modified 5.6 years ago by Kenosis1.2k • written 5.6 years ago by Jemo30
5
gravatar for Sukhdeep Singh
5.6 years ago by
Sukhdeep Singh10.0k
Netherlands
Sukhdeep Singh10.0k wrote:

How about this

# fetch every alternate line (sequence in our case)
awk 'NR%2==0' fasta.fas > seq.fas

# merge line by line using headers from the text file
paste -d'\n' headerFile.txt seq.fas > output

or a one liner would be

awk 'NR%2==0' fasta.fas | paste -d'\n' headerFile.txt - > output
ADD COMMENTlink modified 10 days ago by RamRS25k • written 5.6 years ago by Sukhdeep Singh10.0k

But this assumes sequences span only one line, right?

ADD REPLYlink written 5.6 years ago by dariober10k

Yes, you are right, this will fail, if the sequences span to multiple lines!!

ADD REPLYlink written 5.6 years ago by Sukhdeep Singh10.0k
4
gravatar for dariober
5.6 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

I haven't tested this at all. It's python, see if it works:

fasta= open('seq.fa')
newnames= open('newnames.txt')
newfasta= open('seqnew.fa', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()
ADD COMMENTlink modified 10 days ago by RamRS25k • written 5.6 years ago by dariober10k

Hi thanks for the helpful insights. I executed your suggested code by saving it as replace_name.py:

#!/usr/bin/env python

fasta= open('terS_non1.fasta')
newnames= open('terS_name.txt')
newfasta= open('terS_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

But it doesn't seem to work, with the following error message:

File "replace_name.py", line 3
SyntaxError: Non-ASCII character '\xe2' in file replace_name.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Jemo30

What editor did you use to copy and paste the script? If you used MS word or similar it will contain non-printable characters (Non-ASCII) which you can't see but python will.

ADD REPLYlink written 5.6 years ago by dariober10k

In case anyone ever needs this code. The quotation marks aren't ASCII characters which further complicates the script.

#!/usr/bin/env python

fasta= open('Galaxy58-[Extract_Genomic_DNA_on_data_46_and_data_37].fasta')
newnames= open('names_for_fasta_file.txt')
newfasta= open('trial_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

This is the edited version

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by jjrin10
1
gravatar for Kenosis
5.6 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's a Perl option:

use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Usage:

perl script.pl textFile fastaFile [>outFile]

The last, optional parameter directs output to a file.

Hope this helps!

ADD COMMENTlink modified 10 days ago by RamRS25k • written 5.6 years ago by Kenosis1.2k

I tried it, it kind of worked, but the new file has their header apart from their respective fasta sequences.

My original fasta file would be something like this:

>650_16551;size=22371;
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

>bs5_4497;size=326624;
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

And when I execute your code I get something like this:

ANT_1
ANT_2
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

So the format of the new file is not like a fasta file. Any idea why?

Thanks!!

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Jemo30

I got your desired output on your datasets you've just included. However, I'm not too sure about your text file's formatting. Thus, I've refactored the code block after the first while. Perhaps that will be helpful.

I got the following from both versions:

ANT_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGIL
PCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFL
ANT_2
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL
ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Kenosis1.2k

Thanks for the prompt reply, I really appreciate your inputs! My txt file contains just a single column with rows of ID names (ANT_1, ANT_2, etc...).

I re-executed your updated code, and for some reason I still get the same output as before.

Thanks!

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Jemo30

You're most welcome!

I accidently omitted naming the perl script in the directions. Have fixed the original posting. You should do the following (the last parameter being optional):

perl script.pl textFile fastaFile [>outFile]

My apologies for this oversight.

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Kenosis1.2k

Dear Kenosis,

It's really weird because I still get the same output file using your posted code, which I save as rename.pl:

#!/usr/local/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Then I execute my code:

perl rename.pl name.txt seq.fasta > new seq.fasta

Is it the new updated code?

Many thanks for your time!

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Jemo30

The update only insures that blank lines are skipped -- just in case any exist.

You have:

perl rename.pl name.txt seq.fasta > new seq.fasta

Did you mean new_seq.fasta? You do need the underscore in the name.

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Kenosis1.2k

Thanks for noticing my typo, but yes I made sure to add the underscore in the name of the new file. Using your script as below, the output is still not formatted as it would need to be. Something must be missing.

#!/usr/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Here's what the output looks like after executing the script:

Ant_1

Ant_2

Ant_3
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

It would be wonderful if it could have been in this format, instead:

Ant_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_2
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_3
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

Please let me know if you would have any advice or trick that could improve the output.

Thanks!!

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.6 years ago by Jemo30
1

This was really helpful for me as I am very new to bioinformatics, I used the python script to change my fasta file headings. However I had the same formatting problem initially, and found out that my text file had dos line endings that were incompatible with the unix system I was using. View the text file in terminal with less name.txt and if your list appears as one contiguous line separated by ^M then it was created using dos format. I converted to unix format by re-saving my text file in textwrangler changing the settings. Then the script worked perfectly.

ADD REPLYlink modified 10 days ago by RamRS25k • written 5.2 years ago by emily.remnant10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1191 users visited in the last hour