Question

How to extract the longest isoform from multi fasta file

1

Entering edit mode

9.7 years ago

jon.brate ▴ 290

Hi,

I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms.

Like this:

Original:

>scitn003313.1
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Unique:

>scitn003313.2
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

perl extract fasta python isoform • 14k views

ADD COMMENT • link updated 15 days ago by Madeleine • 0 • written 9.7 years ago by jon.brate ▴ 290

5

Entering edit mode

9.7 years ago

Philippe ★ 1.9k

A small script using the seqinr library in R. Maybe not the most straightforward but I wrote this quickly.

# Loading library seqinr
library(seqinr)

# Loading your input file
fa <- read.fasta("input.fa")

# Getting sequences and gene names from fa object
genes <- sapply(strsplit(names(fa), "\\."), function(v) {return(v[1])})
sequences <- sapply(fa, c2s)

# Extracting longest transcript for each gene
filtered_seq <- tapply(sequences, genes, function(v) {return(v[which(nchar(v)==max(nchar(v)))])})

# Creating an object suitable for the write.fasta function
obj <- tapply(filtered_seq, 1:length(filtered_seq), s2c)

# Writing output file
write.fasta(obj , names(filtered_seq), file="output.fa")

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Philippe ★ 1.9k

0

Entering edit mode

Hi,

The above code works fine. If I want to retain only those sequences which starts with "atg" and ends with "tag or tga ot taa". How can I achieve that with above command?

ADD REPLY • link 8.6 years ago by EVR ▴ 610

0

Entering edit mode

It doesn't work for me at the line

obj <- tapply(filtered_seq, 1:length(filtered_seq), s2c)

ADD REPLY • link 4.2 years ago by lessismore ★ 1.3k

0

Entering edit mode

I had a problem at the same point. This was my inelegant but effective workaround:

# Creating an object suitable for the write.fasta function
obj <- tapply(filtered_seq, 1:length(filtered_seq), c2s)
obj <- tapply(obj, 1:length(obj), s2c)
names(obj) <- names(filtered_seq)
# Writing output file
write.fasta(obj, names(obj), file="./longest.fa")

ADD REPLY • link 15 days ago by Madeleine • 0

0

Entering edit mode

Hi Philippe, I wonder if there is any way of indicate second point and not first one in this part of the script genes <- sapply(strsplit(names(fa), "\."), function(v) {return(v[1])}) The names of my transcript have this format MSTRG.5.1 And I need my gene names like this: MSTRG.5 Best regards!

ADD REPLY • link 3.1 years ago by damigonzalez38 • 0

Ram · Accepted Answer · 2014-07-28

10

Entering edit mode

9.7 years ago

Pierre Lindenbaum 161k

$ cat input.fa |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |\ #linearize fasta
tr "." "\t" |\ #extract version from header
awk -F '   '  '{printf("%s\t%d\n",$0,length($3));}' |\ #extact length
sort -t '    ' -k1,1 -k4,4nr |\ #sort on name, inverse length
sort -t '    ' -k1,1 -u -s |\ #sort on name, unique, stable sort (keep previous order)
sed 's/    /./' |\ #restore name
cut -f 1,2 |\ #cut name, sequence
tr "\t" "\n"  |\ #back to fasta
fold -w 60 #pretty fasta

ADD COMMENT • link 9.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks Pierre!

I seem to manage up to the sort part, and I get this message:

sort: multi-character tab `    '

I tried to read about this but don't understand whats wrong.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by jon.brate ▴ 290

0

Entering edit mode

it's a 'tab' , not some spaces : http://stackoverflow.com/questions/10627989/how-do-i-insert-a-tab-character-in-iterm

ADD REPLY • link 9.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I see. It works great. Thanks!

edit: I removed the fold command to avoid line breaks in the sequences

ADD REPLY • link 9.7 years ago by jon.brate ▴ 290

0

Entering edit mode

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLY • link 7.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi Pierre:

Thank you for this answer. I have nearly a identical situation with one key difference, which I will explain below.

The annotation I am working with denotes slightly different nomenclature for protein isoforms.

For the sake of clarity I'll illustrate this using Phillippe's example.

>scitn003313.1a
CCCTGGCAATCTAAGCCACTGCCGG
>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Where the following represents unique proteins.

>scitn003313.1b
GCAATTGTTACTGTCAAAATGATACAACAAAAAAAAGGTCC
>scitn005976.1
GGCAAAGAAGGAGACAAACCAGCAGGATATACATGAAACCTATAATTGAGCAGAGATTTTA

Thus, for me to similarly retrieve only unique ids from my fasta file would require tweaking the tr expression. However, it appears splitting with a multi-character delimiter with tr is not straightforward.

How could we modify the your script to obtain a unique set of proteins?

Here's a link to the fasta file in question ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS256.protein.fa.gz

Thanks in advance! - Taylor

ADD REPLY • link 7.2 years ago by R. Taylor Raborn ▴ 290

0

Entering edit mode

I don't get the difference with the original question.

it appears splitting with a multi-character delimiter with tr is not straightforward.

what is the multi-character delimiter ?

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

The difference is the nomenclature in this new file.

Letters following the digit after the period represent different isoforms of the same protein.

e.g.

scitn003313.1a

scitn003313.1b

are different isoforms. Whereas

scitn003313.1a

scitn003313.2

Are entirely different proteins.

I tried to delimit with the period and the number that followed, but that may not be the solution we need here.

ADD REPLY • link 7.2 years ago by R. Taylor Raborn ▴ 290

0

Entering edit mode

when I type above command one by one I also got this message: sort: multi-character tab ` '.

Please share how did you resolve. I am not able to understand from above link

ADD REPLY • link 7.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

of course in sort -t ' ' the argument of the option '-t' is a tabulation....

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre,

when I type above command It just mention number of filtered sequences not the fasta file.

[root@psgl mapped_bam]# cat stringtie_transcript.fasta |awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' |tr "." "\t" |awk -F ' ' '{printf("%s\t%d\n",$0,length($3));}' |sort -t ' ' -k1,1 -k4,4nr |sort -t ' ' -k1,1 -u -s |sed 's/ /./' |cut -f 1,2 |tr "\t" "\n" |fold -w 60 > filter_strg.fasta

head filter_strg.fasta

STRG 41115

there were around 77104 sequences including isoforms in the input file.

Please suggest how I can get fasta sequences for these 41115 transcripts.

ADD REPLY • link 7.1 years ago by Bioinfonext ▴ 460