Question

How do you extract longest transcript from fasta file using length and header IDs?

0

Entering edit mode

5.7 years ago

DNAngel ▴ 250

Here is an example file that I made up to practise and it is called Medaka_TEST.fa where the header is GENE_ID | TRANSCRIPT_ID:

$ cat Medaka_TEST.fa
>ENSORLG000000000AA|ENSORLT000000000AD   # This gene has two transcripts (next sequence)
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT
TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG  # Only one occurrence of this gene
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G   # This gene has two transcripts (next sequence)
IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A
SHORTAGAIN

I have searched a lot and found a post (How to extract the longest isoform from multi fasta file) that created a command that worked nicely to extract the longest transcript when there were multiple transcripts belonging to the same GENE. However, I cannot manipulate it properly when my headers have a two-part name separated by "|". I run into a lot of weird issues.

Here is my modified script after so many trials, but I cannot get it to include IDs:

$ cat Medaka_TEST.fa | \
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} \
END {if(N>0) printf("\n");}' | \ # put sequences on same line as headers
tr "|" "\t" | \ # separate GENE_ID and TRANSCRIPT_ID
awk -F '      ' '{printf("%s\t%d\n",$0,length($3));}' | \ # get the length of the sequence should be third column?
sort -t '     ' -k1,1 -k4,4nr | \ # sort by name, reverse length
sort -t '     ' -k1,1 -u -s | \ # sort on name, unique ID, and keeping it in original order
sed 's/       /|/' | \ # return to original name
cut -f 1,3 | \ # cut header and sequence
tr "\t" "\n" # convert back to fasta

Output:

>ENSORLG000000000AA|ENSORLT000000000AD   # correct!
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG    # correct! 
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000A  #wrong! It is grabbing this sequence based on alphabetical order of the second ID (the transcript ID)
SHORTAGAIN

I cannot get it to exclude returning the sequence based on alphabetical order of the TRANSCRIPT_ID. I tested this by changing this third sequence's TRANSCRIPT_ID from ENSORLT0000000000A to ENSORLT0000000000Z, and then it worked by returning the other ENSORLG000000000FF with the longer transcript but this should not happen.

I also think the issue is here in my first awk call:

awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;}

In the second printf statement I have to use multiple "\t" in order to properly separate my ID name and the sequence, otherwise it just overwrites my ID names.

Example, if I use the original poster's command with just one "\t" my names are grossly overwritten (why??):

$ cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLGKEEPTHISONETHISISTHELONGEST0AD
>ENSORLGTOOSHORT0AA|ENSORLT000000000AT
>ENSORLGTHISISFINENODUPLICATES000000AG
>ENSORLGIWANTTHISONEKEEPMEPLEASE00000G
>ENSORLGSHORTAGAINF|ENSORLT0000000000A

But after trial and error for some reason if I put 6 "\t" the name and sequence is fine:

$cat Medaka_TEST.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t\t\t\t\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}'
>ENSORLG000000000AA|ENSORLT000000000AD  KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000AA|ENSORLT000000000AT  TOOSHORT
>ENSORLG000000000CC|ENSORLT000000000AG  THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G  IWANTTHISONEKEEPMEPLEASE
>ENSORLG000000000FF|ENSORLT0000000000A  SHORTAGAIN

I've tried different combinations as I think the way my names are split are causing the script to not obtain proper lengths?

Here is my desired output if I can fix my commands to obtain longest transcripts when GENE_IDs have multiple occurrences.

>ENSORLG000000000AA|ENSORLT000000000AD
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G
IWANTTHISONEKEEPMEPLEASE

awk fasta • 2.9k views

ADD COMMENT • link updated 5.7 years ago by JC 13k • written 5.7 years ago by DNAngel ▴ 250

score 2 · Accepted Answer · 2019-10-17

The commands are complex to control, imho it's better if you put that logic in a script:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs =();
my ($g, $t);
while (<>) {
    chomp;
    if (/>/) {
        ($g, $t) = split (/\|/, $_);
    }
    else {
        $seqs{$g}{$t} .= $_;
    }
}

foreach $g (sort keys %seqs) { # iterate over genes
    my $longest = -1;
    my $fasta = '';
    foreach $t (keys %{ $seqs{$g} }) { # iterate over transcripts
        my $seq = $seqs{$g}{$t};
        if (length $seq > $longest) {
           $fasta = "$g|$t\n$seq\n";
        }
    }
    print $fasta;
}

Output example

$ perl getLongest.pl < file.fa
>ENSORLG000000000AA|ENSORLT000000000AD
KEEPTHISONETHISISTHELONGEST
>ENSORLG000000000CC|ENSORLT000000000AG
THISISFINENODUPLICATES
>ENSORLG000000000FF|ENSORLT0000000000G
IWANTTHISONEKEEPMEPLEASE