I am working on human RNAseq data. I am trying to build rRNA database in order to remove the rRNA contamination. I explored various databases such as Silva rRNA database, UCSC browser ( to get rRNA gene_type), ensemble biomart. In Silva database, I found following count of rRNA sequences
Silva - 3198 Silva ref - none EMBL - 104 RDP - none
Silva - 2662 (human + other organisms) silva Ref - 1999 (human + other) Silva Ref NR - 353 (human _ref) (NR must defines non redundant dataset). Greengenes - none RDP - none
I downloaded all these dataset and end up with approx 1500 sequence (removed duplicated sequence)
On the other hand, from UCSC browser , I found list of approx 560 rRNA sequence.
Can anybody suggest me which set I should consider for next step i.e sortmeRNA database construction in order to remove rRNA contamination from human RNAseq).
I will appreciate all suggestions.