Dear all,
Recently, I used Trinity to assembly my illumina reads to transcriptome and masked the repeat elements using RepeatMasker.
And then, I analyzed the coding regions using TransDecoder to generate the protein sequence set.
And I would like to take the protein sequence excluding those shorter than 21 amino acids, to analyze the orthologus gene pairs using OrthoMCL.
But after using the scripts orthomclAdjustFasta and orthomclFilterFasta (21, 20), the sequences that shorter than 21 aa are still there. More detail please see the following lines.
Could you please tell me what should I do?
Any advice would be great.
Thank you very much in advance.
>TRINITY_DN0_c0_g2_i1_orf1 type:internal len:106 KGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTKSSN >TRINITY_DN2826_c0_g1_i1_orf1 type:3prime_partial len:112 MVRDDHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >TRINITY_GG_2915_c1_g1_i1_orf1 type:internal len:133 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEAVEFEAATAEVKIDLRGLEDLLFGTA >TRINITY_GG_3876_c0_g1_i1_orf1 type:internal len:120 RMQASGVQYGMADVSQFMVGRGPSTRVQNIFQVSPSSDHQQQQYSSQTXXXXXXXXXXXXXXXXXXLLRQQEHRKDQMVAAAEKVGEGSAYNSPCKHLEPSPTPAHQAAQAGNISTDKA
###########
>pde|TRINITY_DN0_c0_g2_i1_orf1 KGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTKSSN >pde|TRINITY_DN2826_c0_g1_i1_orf1 MVRDDHXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >pde|TRINITY_GG_2915_c1_g1_i1_orf1 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEAVEFEAATAEVKIDLRGLEDLLFGTA >pde|TRINITY_GG_3876_c0_g1_i1_orf1 RMQASGVQYGMADVSQFMVGRGPSTRVQNIFQVSPSSDHQQQQYSSQTXXXXXXXXXXXXXXXXXXLLRQQEHRKDQMVAAAEKVGEGSAYNSPCKHLEPSPTPAHQAAQAGNISTDKA