Question

Parsing output of Velvet Oases

0

Entering edit mode

9.2 years ago

apelin20 ▴ 480

Hello,

Oases outputs multiple isoforms per locus. Does anyone have a script to pick the longest isoform? I don't really care about how many there are, I just want to know what genes are there, and the many isoforms are complicating the job.

Adrian

RNA-Seq velvet oases assembly denovo • 2.1k views

ADD COMMENT • link updated 9.2 years ago by Ram 43k • written 9.2 years ago by apelin20 ▴ 480

score 0 · Answer 1 · 2015-02-14

0

Entering edit mode

9.2 years ago

Ram 43k

Use cd-hit-est to cluster the isoforms and pick the representative isoform of each cluster. You can tweak the cluster identity percentage to get optimal output.

ADD COMMENT • link 9.2 years ago by Ram 43k

0

Entering edit mode

The isoforms belonging to the same locus are alreadu known from the output, since they all share the same locus number. I will try using cd-hit, but it seems to ignore that information.

ADD REPLY • link 9.2 years ago by apelin20 ▴ 480

0

Entering edit mode

If sequences from the same loci share any differentiating header content, you could write a quick script to group them and pick the longest of them.

ADD REPLY • link 9.2 years ago by Ram 43k

0

Entering edit mode

This is typical oases header output:

>Locus_1_Transcript_1/1_Confidence_1.000_Length_2277 >Locus_2_Transcript_1/1_Confidence_1.000_Length_422 >Locus_3_Transcript_1/3_Confidence_0.667_Length_1455 >Locus_3_Transcript_2/3_Confidence_0.333_Length_968 >Locus_3_Transcript_3/3_Confidence_0.778_Length_1752 >Locus_4_Transcript_1/9_Confidence_0.571_Length_3767 >Locus_4_Transcript_2/9_Confidence_0.619_Length_3767 >Locus_4_Transcript_3/9_Confidence_0.571_Length_3767 >Locus_4_Transcript_4/9_Confidence_0.571_Length_3771 >Locus_4_Transcript_5/9_Confidence_0.381_Length_2272 >Locus_4_Transcript_6/9_Confidence_0.571_Length_3767 >Locus_4_Transcript_7/9_Confidence_0.524_Length_3713 >Locus_4_Transcript_8/9_Confidence_0.429_Length_4008 >Locus_4_Transcript_9/9_Confidence_0.571_Length_3767 >Locus_5_Transcript_1/23_Confidence_1.000_Length_114 >Locus_5_Transcript_2/23_Confidence_1.000_Length_111 >Locus_5_Transcript_3/23_Confidence_1.000_Length_181 >Locus_5_Transcript_4/23_Confidence_1.000_Length_229 >Locus_5_Transcript_5/23_Confidence_1.000_Length_193 >Locus_5_Transcript_6/23_Confidence_1.000_Length_227 >Locus_5_Transcript_7/23_Confidence_1.000_Length_284 >Locus_5_Transcript_8/23_Confidence_1.000_Length_322 >Locus_5_Transcript_9/23_Confidence_1.000_Length_181 >Locus_5_Transcript_10/23_Confidence_1.000_Length_203 >Locus_5_Transcript_11/23_Confidence_1.000_Length_184 >Locus_5_Transcript_12/23_Confidence_1.000_Length_199 >Locus_5_Transcript_13/23_Confidence_1.000_Length_193 >Locus_5_Transcript_14/23_Confidence_0.419_Length_4097 >Locus_5_Transcript_15/23_Confidence_0.355_Length_4063 >Locus_5_Transcript_16/23_Confidence_0.419_Length_4097

I am not that proefficient in scripting to write something for this. Can it be done with an awk one liner? I just need fror each Locus the transcript with the largest length.

ADD REPLY • link 9.2 years ago by apelin20 ▴ 480

0

Entering edit mode

This should be easy. You'd need to split by _, for each unique value of $2, find max ($8) and for that record, print $0 into a file. Then, use these lines to extract FASTA by id.

ADD REPLY • link 9.2 years ago by Ram 43k