Entering edit mode
9.1 years ago
apelin20
▴
480
Hello,
Oases outputs multiple isoforms per locus. Does anyone have a script to pick the longest isoform? I don't really care about how many there are, I just want to know what genes are there, and the many isoforms are complicating the job.
Adrian
The isoforms belonging to the same locus are alreadu known from the output, since they all share the same locus number. I will try using cd-hit, but it seems to ignore that information.
If sequences from the same loci share any differentiating header content, you could write a quick script to group them and pick the longest of them.
This is typical oases header output:
>Locus_1_Transcript_1/1_Confidence_1.000_Length_2277
>Locus_2_Transcript_1/1_Confidence_1.000_Length_422
>Locus_3_Transcript_1/3_Confidence_0.667_Length_1455
>Locus_3_Transcript_2/3_Confidence_0.333_Length_968
>Locus_3_Transcript_3/3_Confidence_0.778_Length_1752
>Locus_4_Transcript_1/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_2/9_Confidence_0.619_Length_3767
>Locus_4_Transcript_3/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_4/9_Confidence_0.571_Length_3771
>Locus_4_Transcript_5/9_Confidence_0.381_Length_2272
>Locus_4_Transcript_6/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_7/9_Confidence_0.524_Length_3713
>Locus_4_Transcript_8/9_Confidence_0.429_Length_4008
>Locus_4_Transcript_9/9_Confidence_0.571_Length_3767
>Locus_5_Transcript_1/23_Confidence_1.000_Length_114
>Locus_5_Transcript_2/23_Confidence_1.000_Length_111
>Locus_5_Transcript_3/23_Confidence_1.000_Length_181
>Locus_5_Transcript_4/23_Confidence_1.000_Length_229
>Locus_5_Transcript_5/23_Confidence_1.000_Length_193
>Locus_5_Transcript_6/23_Confidence_1.000_Length_227
>Locus_5_Transcript_7/23_Confidence_1.000_Length_284
>Locus_5_Transcript_8/23_Confidence_1.000_Length_322
>Locus_5_Transcript_9/23_Confidence_1.000_Length_181
>Locus_5_Transcript_10/23_Confidence_1.000_Length_203
>Locus_5_Transcript_11/23_Confidence_1.000_Length_184
>Locus_5_Transcript_12/23_Confidence_1.000_Length_199
>Locus_5_Transcript_13/23_Confidence_1.000_Length_193
>Locus_5_Transcript_14/23_Confidence_0.419_Length_4097
>Locus_5_Transcript_15/23_Confidence_0.355_Length_4063
>Locus_5_Transcript_16/23_Confidence_0.419_Length_4097
I am not that proefficient in scripting to write something for this. Can it be done with an awk one liner? I just need fror each Locus the transcript with the largest length.
This should be easy. You'd need to split by _, for each unique value of $2, find max ($8) and for that record, print $0 into a file. Then, use these lines to extract FASTA by id.