I want to assemble my illumina reads from RNA-seq project, I have a reference genome, so, I want to assemble the transcripts with the reference using cufflinks, but my reads comes from a parasite that have a complicated transcriptional maturation that includes insertion of 35 nucleotides (mini-exon) and poly-A tailing for future mRNA translation, so I want to process the 35 mini-exon nucleotides and poly-A tail before assembly. How can I do that? I just had tried to trimming using a simple grep, but I have 34 million of reads where the miniexon could be present with insertions or deletions and grep does not work in that cases, does somebody knows about a perl or python script for do this? Thanks :-)
Are the 35 nucleotides the same or are they somehow individually derived from every mRNA?
thanks for answer, they are the same 35 nucleotides for each mature mRNA, it`s called mini-exon and it is not present in the genome.
thats why I want to remove from the reads in order to improve the assembly