Question: replace fasta headers with another name in a text file
3
gravatar for Jemo
6.6 years ago by
Jemo30
United States
Jemo30 wrote:

Hi everyone,

I have a fasta file and a text file with names on each row:

The fasta file looks like this:

>BQG3565;size=525 

AGGCTT.....

>BGET752;size=3

TTGCCAG.....and so on

The text file looks like this:

ANT_39

ANT_5676

ANT_3 ... and so on. 

I would like to replace each header from the fasta file with the name from each row in the text file. I am a beginner in bioinformatics and was wondering if anyone would be able to help me on this? 

Many thanks!

 

perl • 18k views
ADD COMMENTlink modified 6.6 years ago by Kenosis1.2k • written 6.6 years ago by Jemo30
5
gravatar for Sukhi Singh
6.6 years ago by
Sukhi Singh10k
Netherlands
Sukhi Singh10k wrote:

How about this

# fetch every alternate line (sequence in our case)
awk 'NR%2==0' fasta.fas > seq.fas

# merge line by line using headers from the text file
paste -d'\n' headerFile.txt seq.fas > output

or a one liner would be

awk 'NR%2==0' fasta.fas | paste -d'\n' headerFile.txt - > output
ADD COMMENTlink modified 12 months ago by _r_am32k • written 6.6 years ago by Sukhi Singh10k

But this assumes sequences span only one line, right?

ADD REPLYlink written 6.6 years ago by dariober11k

Yes, you are right, this will fail, if the sequences span to multiple lines!!

ADD REPLYlink written 6.6 years ago by Sukhi Singh10k
4
gravatar for dariober
6.6 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

I haven't tested this at all. It's python, see if it works:

fasta= open('seq.fa')
newnames= open('newnames.txt')
newfasta= open('seqnew.fa', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()
ADD COMMENTlink modified 12 months ago by _r_am32k • written 6.6 years ago by dariober11k

Hi thanks for the helpful insights. I executed your suggested code by saving it as replace_name.py:

#!/usr/bin/env python

fasta= open('terS_non1.fasta')
newnames= open('terS_name.txt')
newfasta= open('terS_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

But it doesn't seem to work, with the following error message:

File "replace_name.py", line 3
SyntaxError: Non-ASCII character '\xe2' in file replace_name.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Jemo30
1

In case anyone ever needs this code. The quotation marks aren't ASCII characters which further complicates the script.

#!/usr/bin/env python

fasta= open('Galaxy58-[Extract_Genomic_DNA_on_data_46_and_data_37].fasta')
newnames= open('names_for_fasta_file.txt')
newfasta= open('trial_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

This is the edited version

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by jjrin40
1

Here's a modified version of jjrin's code that uses argparse so that you can use flags to indicate an input file, record replacement file, and an output file.

import argparse

parser=argparse.ArgumentParser(description="program that replaces fasta records")
parser.add_argument("-i", help="input fasta", type=file)
parser.add_argument("-r", help="replacement records file", type=file)
parser.add_argument("-o", help="output file")
args = parser.parse_args()
newfasta=open(args.o,'w') 

for line in args.i:
    if line.startswith('>'):
        newname=args.r.readline()
        newfasta.write(newname)
    else: 
        newfasta.write(line)
ADD REPLYlink written 11 months ago by Digsby10

Here is a modified version that takes in a tab-delimited lookup table of headers to replace, this works even if only a subset of the headers need replacing and it also works if the headers that need replacing are in a different order than the entries in the lookup table.

[edit] I just tested this and it worked with very large fa files, I just did it for the hg38 and it worked

# replace specific headers from a fa file using a custom made lookup table, tab delimited. 
# Use grep "^>" fasta.fa to help generate that lookup table.
# Code based off of solution from replace fasta headers with another name in a text file

#Example lookup table line
#>old_line  >new_linegrep

import argparse
import csv

parser=argparse.ArgumentParser(description="program that replaces fasta headers")
parser.add_argument("-i", help="input fasta", type=file)
parser.add_argument("-l", help="lookup table with replacement header lines")
parser.add_argument("-o", help="output fasta")
args = parser.parse_args()

# create an output file
newfasta=open(args.o,'w') 

# load lookup table into dict format
lookup_dict = {}
with open(args.l) as lookup_handle:                                                                                          
    lookup_list = csv.reader(lookup_handle, delimiter='\t')
    for entry in lookup_list:
        lookup_dict[entry[0]] = entry[1]

# read in the fa line by line and replace the header if it is in the lookup table
for line in args.i:
    line = line.rstrip("\n")
    if line.startswith('>'):
        if str(line) in lookup_dict.keys():
            newname = lookup_dict[line]
            newfasta.write(newname+"\n")
        else:
            newfasta.write(line+"\n")
    else: 
        newfasta.write(line+"\n")
ADD REPLYlink modified 7 months ago • written 7 months ago by brismiller40

What editor did you use to copy and paste the script? If you used MS word or similar it will contain non-printable characters (Non-ASCII) which you can't see but python will.

ADD REPLYlink written 6.6 years ago by dariober11k
1
gravatar for Kenosis
6.6 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's a Perl option:

use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Usage:

perl script.pl textFile fastaFile [>outFile]

The last, optional parameter directs output to a file.

Hope this helps!

ADD COMMENTlink modified 12 months ago by _r_am32k • written 6.6 years ago by Kenosis1.2k

I tried it, it kind of worked, but the new file has their header apart from their respective fasta sequences.

My original fasta file would be something like this:

>650_16551;size=22371;
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

>bs5_4497;size=326624;
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

And when I execute your code I get something like this:

ANT_1
ANT_2
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

So the format of the new file is not like a fasta file. Any idea why?

Thanks!!

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Jemo30

I got your desired output on your datasets you've just included. However, I'm not too sure about your text file's formatting. Thus, I've refactored the code block after the first while. Perhaps that will be helpful.

I got the following from both versions:

ANT_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGIL
PCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFL
ANT_2
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL
ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Kenosis1.2k

Thanks for the prompt reply, I really appreciate your inputs! My txt file contains just a single column with rows of ID names (ANT_1, ANT_2, etc...).

I re-executed your updated code, and for some reason I still get the same output as before.

Thanks!

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Jemo30

You're most welcome!

I accidently omitted naming the perl script in the directions. Have fixed the original posting. You should do the following (the last parameter being optional):

perl script.pl textFile fastaFile [>outFile]

My apologies for this oversight.

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Kenosis1.2k

Dear Kenosis,

It's really weird because I still get the same output file using your posted code, which I save as rename.pl:

#!/usr/local/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Then I execute my code:

perl rename.pl name.txt seq.fasta > new seq.fasta

Is it the new updated code?

Many thanks for your time!

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Jemo30

The update only insures that blank lines are skipped -- just in case any exist.

You have:

perl rename.pl name.txt seq.fasta > new seq.fasta

Did you mean new_seq.fasta? You do need the underscore in the name.

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Kenosis1.2k

Thanks for noticing my typo, but yes I made sure to add the underscore in the name of the new file. Using your script as below, the output is still not formatted as it would need to be. Something must be missing.

#!/usr/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Here's what the output looks like after executing the script:

Ant_1

Ant_2

Ant_3
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

It would be wonderful if it could have been in this format, instead:

Ant_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_2
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_3
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

Please let me know if you would have any advice or trick that could improve the output.

Thanks!!

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.6 years ago by Jemo30
1

This was really helpful for me as I am very new to bioinformatics, I used the python script to change my fasta file headings. However I had the same formatting problem initially, and found out that my text file had dos line endings that were incompatible with the unix system I was using. View the text file in terminal with less name.txt and if your list appears as one contiguous line separated by ^M then it was created using dos format. I converted to unix format by re-saving my text file in textwrangler changing the settings. Then the script worked perfectly.

ADD REPLYlink modified 12 months ago by _r_am32k • written 6.2 years ago by emily.remnant10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1024 users visited in the last hour