Question: How do you extract longest transcript from fasta file using length and header IDs?
gravatar for DNAngel
7 months ago by
DNAngel40 wrote:

Here is an example file that I made up to practise and it is called Medaka_TEST.fa where the header is GENE_ID | TRANSCRIPT_ID:

$ cat Medaka_TEST.fa
>ENSORLG000000000AA|ENSORLT000000000AD   # This gene has two transcripts (next sequence)
>ENSORLG000000000CC|ENSORLT000000000AG  # Only one occurrence of this gene
>ENSORLG000000000FF|ENSORLT0000000000G   # This gene has two transcripts (next sequence)

I have searched a lot and found a post (How to extract the longest isoform from multi fasta file) that created a command that worked nicely to extract the longest transcript when there were multiple transcripts belonging to the same GENE. However, I cannot manipulate it properly when my headers have a two-part name separated by "|". I run into a lot of weird issues.

Here is my modified script after so many trials, but I cannot get it to include IDs:

$ cat Medaka_TEST.fa | \
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} \
END {if(N>0) printf("\n");}' | \ # put sequences on same line as headers
tr "|" "\t" | \ # separate GENE_ID and TRANSCRIPT_ID
awk -F '      ' '{printf("%s\t%d\n",$0,length($3));}' | \ # get the length of the sequence should be third column?
sort -t '     ' -k1,1 -k4,4nr | \ # sort by name, reverse length
sort -t '     ' -k1,1 -u -s | \ # sort on name, unique ID, and keeping it in original order
sed 's/       /|/' | \ # return to original name
cut -f 1,3 | \ # cut header and sequence
tr "\t" "\n" # convert back to fasta


>ENSORLG000000000AA|ENSORLT000000000AD   # correct!
>ENSORLG000000000CC|ENSORLT000000000AG    # correct! 
>ENSORLG000000000FF|ENSORLT0000000000A  #wrong! It is grabbing this sequence based on alphabetical order of the second ID (the transcript ID)

I cannot get it to exclude returning the sequence based on alphabetical order of the TRANSCRIPT_ID. I tested this by changing this third sequence's TRANSCRIPT_ID from ENSORLT0000000000A to ENSORLT0000000000Z, and then it worked by returning the other ENSORLG000000000FF with the longer transcript but this should not happen.

I also think the issue is here in my first awk call:

awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;}

In the second printf statement I have to use multiple "\t" in order to properly separate my ID name and the sequence, otherwise it just overwrites my ID names.

Example, if I use the original poster's command with just one "\t" my names are grossly overwritten (why??):

$ cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'

But after trial and error for some reason if I put 6 "\t" the name and sequence is fine:

$cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t\t\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'

I've tried different combinations as I think the way my names are split are causing the script to not obtain proper lengths?

Here is my desired output if I can fix my commands to obtain longest transcripts when GENE_IDs have multiple occurrences.

awk fasta • 301 views
ADD COMMENTlink modified 7 months ago by JC10k • written 7 months ago by DNAngel40
gravatar for JC
7 months ago by
JC10k wrote:

The commands are complex to control, imho it's better if you put that logic in a script:


use strict;
use warnings;

my %seqs =();
my ($g, $t);
while (<>) {
    if (/>/) {
        ($g, $t) = split (/\|/, $_);
    else {
        $seqs{$g}{$t} .= $_;

foreach $g (sort keys %seqs) { # iterate over genes
    my $longest = -1;
    my $fasta = '';
    foreach $t (keys %{ $seqs{$g} }) { # iterate over transcripts
        my $seq = $seqs{$g}{$t};
        if (length $seq > $longest) {
           $fasta = "$g|$t\n$seq\n";
    print $fasta;

Output example

$ perl < file.fa
ADD COMMENTlink written 7 months ago by JC10k

It worked! Thank you :)

ADD REPLYlink modified 7 months ago • written 7 months ago by DNAngel40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1585 users visited in the last hour