extract the longest isoform from multi fasta file
1
0
Entering edit mode
5.6 years ago
Alex ▴ 50

Hi; I have a fasta file of transcript sequences and some of the transcripts are in multiple isoforms. I want to make a uniq list of the transcripts and choose the longest sequence where a transcript has several isoforms. Original:

PB.1.1|1:94-3818(+)|c14521/f1p1/240(240 is the sequence length)
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA

PB.1.2|1:1699-3803(+)|c70289/f1p2/124(124 is the sequence length)
TTACAGTATTGAATTTGTTATGAAACCAAAGCTTGAGTTTCTGCTAAGAACCATGAAGAA GCCACTTAAAGCAGTTGTAGAATACCCAAGGTACTTCAGTTATTCACTCGAGGGGAAGATTTAC


ideal Unique isoform:

PB.1.1|1:94-3818(+)|c14521/f1p1/214
AATGGGAGAAGGCTGGCATTGAACAAGACTATGTTAGTAGGATGTTGTTGAAGTATCCAT GGATTCTTTCAACGAGTGTGATAGAGAACTACAGTCAAATGCTGTTGTTTTTCAACCAAA AAAGGATTTCCAGTACAGTCCTCGCTATTGCTGTGAAAAGTTGGCCTCATATTCTTGGCT CCTCTTCAAAAAGAATGAATTCAGTTTTGGAGCTGTTTCATGTTCTGGGCATCAGTAAAA


Have any awk scripts or tools to obtain the sequence,Thank you very much

Thanks

Alex

RNA-Seq • 3.5k views
0
Entering edit mode
0
Entering edit mode

Would PB.1.1 be the first isoform, and PB.1.2 be the second for the gene cluster, given your headers? You mentioned awk, but I like to do this with python, and a default dict where the "gene id" is the key, and the values are the sequence lengths. Then you can grab the longest sequence in each value list. Your approach would depend on how big your fasta file is, whether you would store this dict to memory or loop through is sequence and write out.

0
Entering edit mode
5.6 years ago
Farbod ★ 3.3k

Dear Alex, Hi.

You can have a look at perl Script to get the longest isoform for each 'gene' of Trinity team.

And please be aware that : The longest transcript isn't always the best transcript!

Instead you can collect longest ORF for each gene using tools like Transdecoder.

0
Entering edit mode

Thanks ,I have solved it