Dear all,
My project is about mastitis disease in dairy cows. I am working on miRNA -seq data. Now I want to remove the contaminants from my miRNA sequencing data. I know Rfam is original dataset for this purpose, but I do not know how to get Rfam dataset in a fasta file.
I use the following link for get thee data:
http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/
But there are nearly a thousand files in this page.
Can anyone guide me to solve this problem?
Best regards,
S. Sharifi
Rfam is a dataset of RNA families. How are you planning to use it to remove contaminants (and what is your definition of contaminant)? If you used a miRNA specific kit to prepare your libraries there should only be miRNA in your data.
I know Rfam is RNA family database and use for discard other non-coding RNA (e.g. rRNAs and tRNAs) from our data. Therefore, the remaining readings will include miRNA and unmapped sequences that will be used to search for novels.
I'm not sure what your are reffering to be "contaminants". If you mean cow RNAs that are not miRNAs, then what you want is a complete reference sequence of all RNAs excluding miRNAs. Each of the files in that link you post is the sequence of all RNAs in a given family. I don't know how many of them would be relevant the Cow. On the flip side, they would also include the sequence of miRNAs, so if you filtered against them, you would also remove miRNA sequences.
RNACentral has a more accessible set of RNA reference sequences - but again I don't know if it includes Bos taurus sequences, and I know that it definately does include miRNA sequences, that at the very least you'd have to filter out of it before use.
Why do you want to filter out such "contaminants"? If you map to the genomic sequence, you can just count the reads that DO map to miRNAs. Inflation of counts by reads mapping to miRNAs when they should map elsewhere is likely to be minimal. In fact, I don't really think it would be a major problem even if you mapped directly to the miRNA sequences.
We almost always map to RNACentral, and then look what is and isn't a miRNA posthoc, but then we work almost exclusively in mouse or human.
Okay. I now understand why, but I still don't think its the best idea. Do you any reason to believe that unmapped sequences are more likely to be novel miRNAs, rather than novel tRNAs or novel snRNAs, or even fragments of novel protein coding genes?
Rfam is a dataset of RNA families. How are you planning to use it to remove contaminants (and what is your definition of contaminant)? If you used a miRNA specific kit to prepare your libraries there should only be miRNA in your data.
I know Rfam is RNA family database and use for discard other non-coding RNA (e.g. rRNAs and tRNAs) from our data. Therefore, the remaining readings will include miRNA and unmapped sequences that will be used to search for novels.