I have a data set obtained from sequencing material from a Illumina TruSeq® Small RNA Library. I have already filtered out the 22 nt read length population I am primarily interested in, and successfully analyzed these siRNAs. However, the small RNA subset is only a small fraction of my total reads. Looking at the frequency of small RNAs in my different samples, 22 nt reads are overrepresented with 200K - 600K reads per sample. Still, I have around 20M 50 nt reads for each sample. I have been playing around with assemblies of these 50 nt reads and get interesting results. First I mapped the 50 nt reads to my virus genome (large DNA virus) and kept the mapping reads for subsequent assembly that gave contigs thats making sense when comparing to ORFs in the virus. I am hoping to use the contigs to quantify expression of viral genes and thus get more bang for my buck.
Before I put a lot of work in to this analysis, I would like to ask anyone here if I should be aware of any biases that might be introduced to my data by working with a Illumina TruSeq® Small RNA Library?
I did search to see if any had answered this in previous posts without luck, and my apologies in advance if I missed something here.