Parsing output of Velvet Oases
1
0
Entering edit mode
9.2 years ago
apelin20 ▴ 480

Hello,

Oases outputs multiple isoforms per locus. Does anyone have a script to pick the longest isoform? I don't really care about how many there are, I just want to know what genes are there, and the many isoforms are complicating the job.

Adrian

RNA-Seq velvet oases assembly denovo • 2.1k views
ADD COMMENT
0
Entering edit mode
9.2 years ago
Ram 43k

Use cd-hit-est to cluster the isoforms and pick the representative isoform of each cluster. You can tweak the cluster identity percentage to get optimal output.

ADD COMMENT
0
Entering edit mode

The isoforms belonging to the same locus are alreadu known from the output, since they all share the same locus number. I will try using cd-hit, but it seems to ignore that information.

ADD REPLY
0
Entering edit mode

If sequences from the same loci share any differentiating header content, you could write a quick script to group them and pick the longest of them.

ADD REPLY
0
Entering edit mode

This is typical oases header output:

>Locus_1_Transcript_1/1_Confidence_1.000_Length_2277
>Locus_2_Transcript_1/1_Confidence_1.000_Length_422
>Locus_3_Transcript_1/3_Confidence_0.667_Length_1455
>Locus_3_Transcript_2/3_Confidence_0.333_Length_968
>Locus_3_Transcript_3/3_Confidence_0.778_Length_1752
>Locus_4_Transcript_1/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_2/9_Confidence_0.619_Length_3767
>Locus_4_Transcript_3/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_4/9_Confidence_0.571_Length_3771
>Locus_4_Transcript_5/9_Confidence_0.381_Length_2272
>Locus_4_Transcript_6/9_Confidence_0.571_Length_3767
>Locus_4_Transcript_7/9_Confidence_0.524_Length_3713
>Locus_4_Transcript_8/9_Confidence_0.429_Length_4008
>Locus_4_Transcript_9/9_Confidence_0.571_Length_3767
>Locus_5_Transcript_1/23_Confidence_1.000_Length_114
>Locus_5_Transcript_2/23_Confidence_1.000_Length_111
>Locus_5_Transcript_3/23_Confidence_1.000_Length_181
>Locus_5_Transcript_4/23_Confidence_1.000_Length_229
>Locus_5_Transcript_5/23_Confidence_1.000_Length_193
>Locus_5_Transcript_6/23_Confidence_1.000_Length_227
>Locus_5_Transcript_7/23_Confidence_1.000_Length_284
>Locus_5_Transcript_8/23_Confidence_1.000_Length_322
>Locus_5_Transcript_9/23_Confidence_1.000_Length_181
>Locus_5_Transcript_10/23_Confidence_1.000_Length_203
>Locus_5_Transcript_11/23_Confidence_1.000_Length_184
>Locus_5_Transcript_12/23_Confidence_1.000_Length_199
>Locus_5_Transcript_13/23_Confidence_1.000_Length_193
>Locus_5_Transcript_14/23_Confidence_0.419_Length_4097
>Locus_5_Transcript_15/23_Confidence_0.355_Length_4063
>Locus_5_Transcript_16/23_Confidence_0.419_Length_4097

I am not that proefficient in scripting to write something for this. Can it be done with an awk one liner? I just need fror each Locus the transcript with the largest length.

ADD REPLY
0
Entering edit mode

This should be easy. You'd need to split by _, for each unique value of $2, find max ($8) and for that record, print $0 into a file. Then, use these lines to extract FASTA by id.

ADD REPLY

Login before adding your answer.

Traffic: 1040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6