Question: How to extract the longest isoform from multi fasta file
1
gravatar for jon.brate
4.7 years ago by
jon.brate200
Norway
jon.brate200 wrote:

Hi,

I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms.

Like this:

Original:

>scitn003313.1
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

 

Unique:

>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

 

python isoform extract perl fasta • 5.9k views
ADD COMMENTlink modified 23 months ago by Biostar ♦♦ 20 • written 4.7 years ago by jon.brate200
8
gravatar for Pierre Lindenbaum
4.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:
$ cat input.fa |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |\ #linearize fasta
tr "." "\t" |\ #extract version from header
awk -F '   '  '{printf("%s\t%d\n",$0,length($3));}' |\ #extact length
sort -t '    ' -k1,1 -k4,4nr |\ #sort on name, inverse length
sort -t '    ' -k1,1 -u -s |\ #sort on name, unique, stable sort (keep previous order)
sed 's/    /./' |\ #restore name
cut -f 1,2 |\ #cut name, sequence
tr "\t" "\n"  |\ #back to fasta
fold -w 60 #pretty fasta
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Pierre Lindenbaum119k

Thanks Pierre!

I seem to manage up to the sort part, and I get this message: sort: multi-character tab `    '

I tried to read about this but don't understand whats wrong.

ADD REPLYlink written 4.7 years ago by jon.brate200

it's a 'tab' , not some spaces : http://stackoverflow.com/questions/10627989/how-do-i-insert-a-tab-character-in-iterm

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Pierre Lindenbaum119k

I see. It works great. Thanks!

edit: I removed the fold command to avoid line breaks in the sequences

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by jon.brate200

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLYlink written 2.1 years ago by Bioinfonext140

Hi Pierre:

Thank you for this answer. I have nearly a identical situation with one key difference, which I will explain below.

The annotation I am working with denotes slightly different nomenclature for protein isoforms.

For the sake of clarity I'll illustrate this using Phillippe's example.

>scitn003313.1a
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Where the following represents unique proteins.

>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Thus, for me to similarly retrieve only unique ids from my fasta file would require tweaking the tr expression. However, it appears splitting with a multi-character delimiter with tr is not straightforward.

How could we modify the your script to obtain a unique set of proteins?

Here's a link to the fasta file in question ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS256.protein.fa.gz

Thanks in advance! - Taylor

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by R. Taylor Raborn290

I don't get the difference with the original question.

it appears splitting with a multi-character delimiter with tr is not straightforward.

what is the multi-character delimiter ?

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum119k

The difference is the nomenclature in this new file.

Letters following the digit after the period represent different isoforms of the same protein.

e.g.

scitn003313.1a

scitn003313.1b

are different isoforms. Whereas

scitn003313.1a

scitn003313.2

Are entirely different proteins.

I tried to delimit with the period and the number that followed, but that may not be the solution we need here.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by R. Taylor Raborn290

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLYlink written 2.1 years ago by Bioinfonext140

of course in sort -t ' ' the argument of the option '-t' is a tabulation....

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Pierre Lindenbaum119k

Hi Pierre,

when I type above command It just mention number of filtered sequences not the fasta file.

[root@psgl mapped_bam]# cat stringtie_transcript.fasta |awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |tr "." "\t" |awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' |sort -t ' ' -k1,1 -k4,4nr |sort -t ' ' -k1,1 -u -s |sed 's/ /./' |cut -f 1,2 |tr "\t" "\n" |fold -w 60 > filter_strg.fasta

head filter_strg.fasta

STRG 41115

there were around 77104 sequences including isoforms in the input file.

Please suggest how I can get fasta sequences for these 41115 transcripts.

ADD REPLYlink written 2.1 years ago by Bioinfonext140
4
gravatar for Philippe
4.7 years ago by
Philippe1.9k
Barcelona, Spain.
Philippe1.9k wrote:

A small script using the seqinr library in R. Maybe not the most straightforward but I wrote this quickly.

 

# Loading library seqinr
library(seqinr)

# Loading your input file
fa <- read.fasta("input.fa")

# Getting sequences and gene names from fa object
genes <- sapply(strsplit(names(fa), "\\."), function(v) {return(v[1])})
sequences <- sapply(fa, c2s)

# Extracting longest transcript for each gene
filtered_seq <- tapply(sequences, genes, function(v) {return(v[which(nchar(v)==max(nchar(v)))])})

# Creating an object suitable for the write.fasta function
obj <- tapply(filtered_seq, 1:length(filtered_seq), s2c)

# Writing output file
write.fasta(obj , names(filtered_seq), file="output.fa")

 

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Philippe1.9k

Hi,

The above code works fine. If I want to retain only those sequences which starts with "atg" and ends with "tag or tga ot taa". How can I achieve that with above command?

ADD REPLYlink written 3.6 years ago by EVR510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 719 users visited in the last hour