Question

Cleaning/filtering RNA seq reads from novel species

0

Entering edit mode

3.7 years ago

Dunois ★ 2.5k

Hi all,

I am working with RNA seq reads from some novel eukaryotes from diverse phyla. I want to assemble the reads de novo but I'd like to clean the reads first.

I've checked the initial FastQC reports, and that GC content plot everyone refers to is multimodal for most of my samples.

I think I could filter out rRNA (if any is present) first using sortmerna.

But I'm a bit lost as to what to do next.

A lot of posts that I read here on BioStars suggest using BBSplit to map the reads against multiple references to bin them. I guess, ideally, in this case, I'd have to have a reference genome to bin against, or have some idea of what the contaminant(s) could be (so that I could use their genomes as references). But the problem is none of my organisms have a sequenced genome available. I don't know what contaminants could be present in them either (if it helps, these are all marine protists).

One idea I had in this regard is to assemble the transcriptomes first, and then Blast them against their nearest available phylogenetic neighbors. Then I'd retain everything that has a significant match, and throw away the rest as "contaminants". Problem is, I don't know if I should Blast against transcriptomes (fraught with lack of overlap due to different expression profiles), genomes (I have no clue how to even implement this), or proteins (the protein set available might not be complete, so I'd stand to lose novel sequences).

So my question is: what is the best way to filter/clean RNA seq reads in a situation like this? Are there any general guidelines or recommendations that I could follow, perhaps? Does anybody here have any experience in this? Any inputs and help would be much appreciated!

RNA-Seq sequencing Assembly denovo • 1.2k views

ADD COMMENT • link updated 3.6 years ago by Biostar 20 • written 3.7 years ago by Dunois ★ 2.5k

0

Entering edit mode

I am working with RNA seq reads from some novel eukaryotes from diverse phyla

Are these samples independent or is this a metagenomic sample where a single sample contains multiple organisms? You should only worry about cleaning the sequences of adapters (if the are present) if you are going to assemble them into individual transcriptomes. If this is a metatranscriptomic sample then the same advice would apply for step 1.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

Hi @genomax. These are independent samples--each sample is a different species (sequenced from a pool of individuals). Adapter filtering has already been performed. My main concern is that reads from transcriptomes of whatever organisms is present inside these guys (e.g., bacteria, or undigested small plankton they ate) would have been captured by the RNA seq, and would end up in the final assemblies. I'm concerned about this because the source individuals were taken from the wild directly, without any laboratory staging in-between before being sequenced. Should I not be concerned about this?

ADD REPLY • link 3.7 years ago by Dunois ★ 2.5k

0

Entering edit mode

Depending on the method used for RNAseq there is a good possibility that RNA from bacteria would be excluded (did you do poly-A capture, bacterial rRNA depletion?). Plankton may have tougher cell walls and thus may not contaminate the final RNA pool. Again would depend on how you treated your organism of interest to get its RNA. If there are other eukaryotes then those would be harder to exclude. Since you are working with a special sample you may need to do this analysis iteratively.

Start building the transcriptome (trinity?). Then use whatever sequences that may be available for nearest relatives (if genomes are available great, otherwise look in EST database at NCBI and get as many sequences as you can). Identify sequences that hit the known in database and separate them out from your data. Then you will have to carefully weed through the remaining sequences to assess what may or many not be from the species you are interested in. The answer in any case may not be cut and dry.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

I believe the mRNA was acquired using poly-A capture (I didn't do the wet lab stuff, so I can't really tell, unfortunately).

I will try out what you've suggested. I was hoping I'd not have to do that, but as you pointed out, this is probably not going to be cut and dry.

I had one more idea on how to approach filtering out the contaminants, and it'd be nice if you could voice your thoughts on it. Say, for example, that the sample in question is a nematode of some sort. So what if I map the reads (not the assemblies?) with a tool like Kraken2 against a database like RefSeq, and keep only whatever doesn't map those databases for downstream assembly?

ADD REPLY • link 3.7 years ago by Dunois ★ 2.5k

0

Entering edit mode

So what if I map the reads (not the assemblies?) with a tool like Kraken2 against a database like RefSeq,

You could try it out but I have some doubts as to how well this will work. Short reads may map across many different sequences in RefSeq even if they did not originate from that organism. If you want to do that with the assemblies then it may be more reasonable.

Some contaminants would be obvious and you can tackle them easily post-assembly. Others are going to be harder to discriminate.