Compare fasta header with id fasta sequence
2
0
Entering edit mode
7.0 years ago
Amy • 0

Hi guys.

I have two files:

  • The first is a tab file. In its first column i have list of location and description of fasta sequence in the 2nd column.
  • The second is a multi fasta file. Some sequences begin with a normal header and others with the location in it. I'd like to compare the two files and replace the "LOC" header in the multi fasta with the location and the corresponding description in the tab file, if they share the same location.

The tab file look like this :

LOC105031928       regulator of telomere elongation helicase 1 

LOC105031929       pathogenesis-related protein 1B-like

In my multi fasta reference I may have some sequences likes:

>LOC105031928
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT

I would like to modify this sequence for example and get as output:

>LOC105031928 | regulator of telomere elongation helicase 1 
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT

I tried to do it with this code sample

#!/usr/local/perl-5.24.0/bin/perl
use strict;
use warnings;
use strict;
use Data::Dumper;

my $tabfile = $ARGV[0];
my $Reference = $ARGV[1];

open REF, "<", $Reference or die "Cannot open $Reference";
open TAB, "<", $tabfile or die "Cannot open $tabfile";
open RES, ">", "RES.fasta" or die "Cannot open RES.txt";

my %hashref;
while (my $ligne = <REF>){
    if ($ligne =~ /^>LOC/){
        my $reflines = (split (m/>/, $ligne))[1];
        $hashref{$reflines} = $ligne;
    }
}

my ($LOC, $DES, $header);
my @lines = <TAB>;
foreach my $line(@lines){
    while (my ($key, $value) = each %hashref){
    $LOC = (split (m/\t/, $line))[0];
    $DES = (split (m/\t/, $line))[1];
    $header = $LOC."|".$DES;
    if ($LOC !~ $key){
        ## Here is the part where I stuck. I tried printing only $key into RES  
               ## to debug the program but it prints all possible combinations 
                  ## between $line and $key;
    }
}

}

close (TAB);
close (RES);
close(REF);
exit;

Any idea of an outcome. I tried several methods but still can't figure out how to manage both files and edit my multi fasta sequence. Thanks.

RNA-Seq perl • 2.2k views
ADD COMMENT
1
Entering edit mode

This has almost certainly been answered before. @Pierre generally has a list of all answered fasta-related threads handy.

Here is one: modifying fasta header

ADD REPLY
0
Entering edit mode

Thank you for the links. I'll take a look.

ADD REPLY
0
Entering edit mode

There have been multiple questions/answers similar to this on biostars.

I tried several methods but still can't figure out how to manage both files and edit my multi fasta sequence.

What have you tried? Please show in your question.

ADD REPLY
0
Entering edit mode

I Just posted a part of my script. Thank you for the python code but unfortunately since the program I'd like to get is part of a huge script written in perl I'm afraid I can't combine both script in the same.

ADD REPLY
0
Entering edit mode

I'd like to get is part of a huge script written in perl

That should be added to the original post.

ADD REPLY
0
Entering edit mode

Firstly, you have to know how to parse FASTA file in Perl. I wrote a function or this optimized but hard to read one

No need to iterate the hash, just check if a key exists in a hash using exists.

ADD REPLY
1
Entering edit mode
7.0 years ago
st.ph.n ★ 2.7k

Here's a Python solution:

(1) Make your multi-line fasta a single line fasta. This makes it simpler to code.

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:    # open tab-delimited description file
    descript = {}    # Empty dictionary
    for line in f:
        descript[line.strip().split('\t')[0]] = line.strip().split('\t')[1]     # Place the location:description in 'descript' dictionary

with open(sys.argv[2], 'r') as fasta:    # Open fasta file
    seqs = {}    
    for line in fasta:
        if line.startswith(">"):
            seqs[line.strip().split('>')[1]] = next(f).strip()       # Place the header:sequence in 'seqs' dictionary

with open(sys.argv[3], 'w') as out:      # Open output file
    for i in descript:         # For each location in 'descript' dictionary
        if i in seqs:             # If the location is contained in the FASTA header
            out.write(">" + i + '|' + descript[i], '\n', seqs[i])   #Print to the ouput FASTA, a new header, with '> LOCATION | Description, sequence

(2) Save the above as comp_fasta.py, run as python comp_fasta.py descriptions_file FASTA_file output_filename

ADD COMMENT
0
Entering edit mode

LOC|its descriptions All linearized sequence sequences without their header.

ADD REPLY
0
Entering edit mode
7.0 years ago

A fast and easy solution using SeqKit for people having no time to write a script. SeqKit supports Linux/Windows/OS X, just download, decompress and immediately run.

seqkit replace -k loc2desciption.txt -p "(.+)" -r "\$1 | {kv}" seqs.fa

Explain:

  • -k/--kv-file loc2desciption.txt, a tab-delimited file with first column as key and the second as value.
  • -p/--pattern "(.+)", regular expression to capture the key, here is the whole FASTA header.
  • -r/--replacement "\$1 | {kv}", $1 refers to the 1th matched group in "(.+)", and {kv} is the the value of the key.
ADD COMMENT

Login before adding your answer.

Traffic: 2826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6