I want to assemble my illumina reads from RNA-seq project, I have a reference genome, so, I want to assemble the transcripts with the reference using cufflinks, but my reads comes from a parasite that have a complicated transcriptional maturation that includes insertion of 35 nucleotides (mini-exon) and poly-A tailing for future mRNA translation, so I want to process the 35 mini-exon nucleotides and poly-A tail before assembly. How can I do that? I just had tried to trimming using a simple grep, but I have 34 million of reads where the miniexon could be present with insertions or deletions and grep does not work in that cases, does somebody knows about a perl or python script for do this? Thanks :-)
You could use BBDuk from BBMap tools. Add the 35 nucleotides to the "adapters.fa" file (as a fasta entry) in the "resources" directory as a separate entry. If you expect AAAA's to show up then you could add an entry for that as well. BBDuk will trim reads to the right (ktrim=r) when it encounters the sequences in the adapter file in your reads.