Question: Compare fasta header with id fasta sequence
0
gravatar for Amy
3.4 years ago by
Amy0
Amy0 wrote:

Hi guys.

I have two files:

  • The first is a tab file. In its first column i have list of location and description of fasta sequence in the 2nd column.
  • The second is a multi fasta file. Some sequences begin with a normal header and others with the location in it. I'd like to compare the two files and replace the "LOC" header in the multi fasta with the location and the corresponding description in the tab file, if they share the same location.

The tab file look like this :

LOC105031928       regulator of telomere elongation helicase 1 

LOC105031929       pathogenesis-related protein 1B-like

In my multi fasta reference I may have some sequences likes:

>LOC105031928
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT

I would like to modify this sequence for example and get as output:

>LOC105031928 | regulator of telomere elongation helicase 1 
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT

I tried to do it with this code sample

#!/usr/local/perl-5.24.0/bin/perl
use strict;
use warnings;
use strict;
use Data::Dumper;

my $tabfile = $ARGV[0];
my $Reference = $ARGV[1];

open REF, "<", $Reference or die "Cannot open $Reference";
open TAB, "<", $tabfile or die "Cannot open $tabfile";
open RES, ">", "RES.fasta" or die "Cannot open RES.txt";

my %hashref;
while (my $ligne = <REF>){
    if ($ligne =~ /^>LOC/){
        my $reflines = (split (m/>/, $ligne))[1];
        $hashref{$reflines} = $ligne;
    }
}

my ($LOC, $DES, $header);
my @lines = <TAB>;
foreach my $line(@lines){
    while (my ($key, $value) = each %hashref){
    $LOC = (split (m/\t/, $line))[0];
    $DES = (split (m/\t/, $line))[1];
    $header = $LOC."|".$DES;
    if ($LOC !~ $key){
        ## Here is the part where I stuck. I tried printing only $key into RES  
               ## to debug the program but it prints all possible combinations 
                  ## between $line and $key;
    }
}

}

close (TAB);
close (RES);
close(REF);
exit;

Any idea of an outcome. I tried several methods but still can't figure out how to manage both files and edit my multi fasta sequence. Thanks.

rna-seq perl • 1.1k views
ADD COMMENTlink modified 3.4 years ago by shenwei3565.3k • written 3.4 years ago by Amy0
1

This has almost certainly been answered before. @Pierre generally has a list of all answered fasta-related threads handy.

Here is one: modifying fasta header

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax89k

Thank you for the links. I'll take a look.

ADD REPLYlink written 3.4 years ago by Amy0

There have been multiple questions/answers similar to this on biostars.

I tried several methods but still can't figure out how to manage both files and edit my multi fasta sequence.

What have you tried? Please show in your question.

ADD REPLYlink written 3.4 years ago by st.ph.n2.5k

I Just posted a part of my script. Thank you for the python code but unfortunately since the program I'd like to get is part of a huge script written in perl I'm afraid I can't combine both script in the same.

ADD REPLYlink written 3.4 years ago by Amy0

I'd like to get is part of a huge script written in perl

That should be added to the original post.

ADD REPLYlink written 3.4 years ago by genomax89k

Firstly, you have to know how to parse FASTA file in Perl. I wrote a function or this optimized but hard to read one

No need to iterate the hash, just check if a key exists in a hash using exists.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by shenwei3565.3k
1
gravatar for st.ph.n
3.4 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Here's a Python solution:

(1) Make your multi-line fasta a single line fasta. This makes it simpler to code.

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:    # open tab-delimited description file
    descript = {}    # Empty dictionary
    for line in f:
        descript[line.strip().split('\t')[0]] = line.strip().split('\t')[1]     # Place the location:description in 'descript' dictionary

with open(sys.argv[2], 'r') as fasta:    # Open fasta file
    seqs = {}    
    for line in fasta:
        if line.startswith(">"):
            seqs[line.strip().split('>')[1]] = next(f).strip()       # Place the header:sequence in 'seqs' dictionary

with open(sys.argv[3], 'w') as out:      # Open output file
    for i in descript:         # For each location in 'descript' dictionary
        if i in seqs:             # If the location is contained in the FASTA header
            out.write(">" + i + '|' + descript[i], '\n', seqs[i])   #Print to the ouput FASTA, a new header, with '> LOCATION | Description, sequence

(2) Save the above as comp_fasta.py, run as python comp_fasta.py descriptions_file FASTA_file output_filename

ADD COMMENTlink written 3.4 years ago by st.ph.n2.5k

LOC|its descriptions All linearized sequence sequences without their header.

ADD REPLYlink written 3.4 years ago by Amy0
0
gravatar for shenwei356
3.4 years ago by
shenwei3565.3k
China
shenwei3565.3k wrote:

A fast and easy solution using SeqKit for people having no time to write a script. SeqKit supports Linux/Windows/OS X, just download, decompress and immediately run.

seqkit replace -k loc2desciption.txt -p "(.+)" -r "\$1 | {kv}" seqs.fa

Explain:

  • -k/--kv-file loc2desciption.txt, a tab-delimited file with first column as key and the second as value.
  • -p/--pattern "(.+)", regular expression to capture the key, here is the whole FASTA header.
  • -r/--replacement "\$1 | {kv}", $1 refers to the 1th matched group in "(.+)", and {kv} is the the value of the key.
ADD COMMENTlink written 3.4 years ago by shenwei3565.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1977 users visited in the last hour