Hello,
I am a newbie in sequencing data analysis and I am trying to run a complete analysis on raw RNA-seq in order to detect the miRNA variability and to possibly identify novel miRNAs. My data is in Illumina tag count format and before proceeding to the aligning to the genome step I should remove erroneous reads from the raw data. However, I do not know how to identify the reads containing erroneous base calls by using the count values. I have read that the error containing reads generated by Illumina contain adenines more than any other base per each read. Is it enough to remove these reads or should anything else be cosnidered?
The adaptors are removed and the sequences are in the following format
TTTTTTTTTTTTGTTTTTATGCTTTAGTCTTCTTTG 34
How can one identify the ambiguous base calls in order to remove those sequences?
Thank you!
@ Sean Devis -The adapters been removed...the rows look something like AAAAAGAGAAAAAAATTGTTTTTCGTGTGTTGTTTT 1 . So don't I need to remove the reads containing ambiguous bases before aligning them ? Sorry, but I have got a bit confused by all these tutorials..and, I don't really understand this format...so from qseq to fast-q to tag_count? How should it be approached?