Question: Adding Unique Identifiers To Fasta Headers
0
gravatar for bioinfo
5.4 years ago by
bioinfo690
EU
bioinfo690 wrote:

I have 5000 FASTA sequences with Uniprot ids. Now, I want to add a unique identifier at the beginning of each FASTA header. An example will explain you a bit better.

>sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------

>tr|B57384|ref|DRF_ERG ribosomal protein 
-------------------------------------------------------------

And so on

I want to add ABC0001 to ABC5000 at the beginning of the fasta header. And the corresponding gene name from my txt file.

gopA  ABC0001 A12345
gopD  ABC0002 B57384
........................
fotR   ABC5000  C12345

Output:

>ABC0001|gopA|sp|A12345|ref|COG_REF human ribosomal protein
    -------------------------------------------------------------

>ABC0002|gopD|tr|B57384|ref|DRF_ERG ribosomal protein 
    -------------------------------------------------------------

    And so on

As I understand, I have to match the uniprot IDs from the txt file to the FASTA sequence file and then grab the ABC ids ( e.g. ABC0001) and Gene name (gopA) and add them at the beginning of the FASTA header.

fasta perl awk • 2.9k views
ADD COMMENTlink modified 5.3 years ago by Biostar ♦♦ 20 • written 5.4 years ago by bioinfo690

What have you tried?

ADD REPLYlink written 5.4 years ago by Kenosis1.2k

I have modified the txt file in a better way

less uniprotid_ABCid_genename.txt| sed 's/\t/|/g'

output>

P87546|ABC0447|yohF
P12345|ABC0448|pogR

Now trying to match –f1 (P87546) and replacing the Uniprot id (P87546) with “P87546|ABC0447|yohF” in FASTA file below. Later I can change the order of the ids in the FASTA header to >ABC0447|yohF|tr|P87546.

 tr|P87546|ref|TRG_DEF ribosomal protein
   ..................................................................

I don’t know any one liner solution that can do it. Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. I just added non-unique identifier to each sequence but that’s not what I want. I’m really close, need a bit more time I guess.

less file.fasta | sed 's/^>/>ABC0001|gopT|/g'| less

>ABC0001|gopT|tr|B5LX01|B123451_ABCFF OS=Campylobacter jejuni GN=xxx PE=4 SV=1
...............................................
ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by bioinfo690

Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. Yes--you will need to load your gene names first, then apply a substitution to the header. If you don't mind, I'll post a solution candidate...

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Kenosis1.2k

Join could probably do this. General syntax:

    join -1 1 -2 1 -o 1.3,2.1 <(sort -k1,1 file1) <(sort -k2,2 file2) > output

Would join based on file1 column 1 and file2 column2, and output file1 column 3 file2 column1. You might first have to strip the fasta headers from the fasta file. After making new headers you can replace the old ones with the new ones, again with join.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by 5heikki8.3k
1
gravatar for Kenosis
5.4 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

I think your comment about processing the gene names first, then the fasta file is on target. Given this, consider the following:

use strict;
use warnings;

my $fasta = pop;
my %hash;

while (<>) {
    my @x = split;
    $hash{ $x[2] } = "$x[1]|$x[0]";
}

push @ARGV, $fasta;

while (<>) {
    s/>(\w+)\|(\w+)/>$hash{$2}|$1|$2/;
    print;
}

Usage: perl script.pl geneNames fastaFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>ABC0001|gopA|sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------

>ABC0002|gopD|tr|B57384|ref|DRF_ERG ribosomal protein 
-------------------------------------------------------------

The script first (implicitly) pops the fasta file name off of @ARGV. It then processes the geneName lines, using the ID as a hash key and the other, combined items as the assocciated value. Next, the fasta file's processed, the ID's captured, and a substitution is made using the hash value.

Hope this helps!

Edit: Added |$2 to the substitution.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Kenosis1.2k
my @x = split;
    $hash{ $x[2] } = "$x[1]|$x[0]";

here, you mean split by "|" of the GeneNames txt file (as P123456|gopA|ABC0001 single column format?) and then hash them as gopA = ABC0001|P123456. I will give a test run.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by bioinfo690

my @x = split; splits your gene name data on whitespace, and "$x[1]|$x[0]" creates a string (the value associated to an ID as the key) that uses the pipe character, like in the header. Thus, the line "gopA ABC0001 A12345" would become { A12345 => 'ABC0001|gopA' }.

In case you didn't notice my Edit remark at the end of the solution, I added |$2 to the substitution, as I had inadvertently left out part of the header.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Kenosis1.2k

perfect. I realised that the genenames txt file should be in this order.

corT    ABC0447    P12345

I just added the ID ($2) as well to get the exact output.

while (<>) {
    s/>(\w+)\|(\w+)/>$hash{$2}|$1|$2/;
    print;

I really appreciate for your help.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by bioinfo690

You're most welcome, C. Pal!

ADD REPLYlink written 5.4 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 884 users visited in the last hour