My project is about mastitis disease in dairy cows. I am working on miRNA -seq data. Now I want to remove the contaminants from my miRNA sequencing data. I know Rfam is original dataset for this purpose, but I do not know how to get Rfam dataset in a fasta file.
I use the following link for get thee data:
But there are nearly a thousand files in this page.
Can anyone guide me to solve this problem?
I'm not sure what your are reffering to be "contaminants". If you mean cow RNAs that are not miRNAs, then what you want is a complete reference sequence of all RNAs excluding miRNAs. Each of the files in that link you post is the sequence of all RNAs in a given family. I don't know how many of them would be relevant the Cow. On the flip side, they would also include the sequence of miRNAs, so if you filtered against them, you would also remove miRNA sequences.
RNACentral has a more accessible set of RNA reference sequences - but again I don't know if it includes Bos taurus sequences, and I know that it definately does include miRNA sequences, that at the very least you'd have to filter out of it before use.
Why do you want to filter out such "contaminants"? If you map to the genomic sequence, you can just count the reads that DO map to miRNAs. Inflation of counts by reads mapping to miRNAs when they should map elsewhere is likely to be minimal. In fact, I don't really think it would be a major problem even if you mapped directly to the miRNA sequences.
We almost always map to RNACentral, and then look what is and isn't a miRNA posthoc, but then we work almost exclusively in mouse or human.