Identify contaminants in my transcriptomic sequences
1
0
Entering edit mode
7 months ago

Dear community, I'm currently working with the transcriptome from a nonmodel plant organism. For this study I began to assemble a transcriptome using the genome as a guide and using short + long reads. Afterward, I decided to extract all the non-mapping read pairs (from short reads) and nonmapping long reads (around 10% of the data) and build a new transcriptome reference-free. I also decided to check the quality of the non-mapping reads. Not surprisingly, I got some reads which I suspect have contamination since some of the samples contained 2 highpoints in the GC content plot. I assembled my reads using Trinity and I decided to blast randomly 100 sequences against nr. I was expecting to find fungi, human or animal sequences, but instead, I only got plant sequences in my results. Although this appears to be good news I want to make sure I really do not have contaminants sequences. What would be the best path to make sure I do not have contamint sequences?

Assembly transcriptome de-novo • 331 views
0
Entering edit mode
7 months ago
GenoMax 107k

What would be the best path to make sure I do not have contaminant sequences?

More sequencing and careful annotation work. While you have done your assemblies I doubt you have a finished genome for your organism (that last 10% may take more work as the first 90%). So it is quite possible that what you have are sequences from your organism. You can design some PCR primers and see if you can recover the sequences from genome.

I was expecting to find fungi, human or animal sequences,

Why should that happen? If experimental/sequencing people were diligent in their experimental protocols then random contamination is not likely.

0
Entering edit mode

I forgot to add this to the post, but some of the samples were grown outside without controlled conditions (this is a weird experimental setup, mas it was relevant for our biological question).

More sequencing and careful annotation work. While you have done your assemblies I doubt you have a finished genome for your organism (that last 10% may take more work as the first 90%). So it is quite possible that what you have are sequences from your organism. You can design some PCR primers and see if you can recover the sequences from genome.

I do not have the finished genome (it is not even at a chromosome level). What am I trying to figure out is if in these sequences there transcripts that does not belong to my species. Bellow is the GC plot I talked in my post

0
Entering edit mode

Judging this purely using informatics is going to be inconclusive. As you already discovered some of these are coming up as plant sequences. That of course does not say much since they could be from diverse plant lineages and may represent contaminants.

I forgot to add this to the post, but some of the samples were grown outside without controlled conditions (this is a weird experimental setup, mas it was relevant for our biological question).

You could try assembling a separate transcriptome from the controlled samples and see if things similar to these sequences show up there? If they don't then this observation could be considered a +1 for these being possible contaminants.

You could also build two transcriptomes (controlled samples and not) and then only select transcripts that are common in both?

What is the long term goal of your experiment? To generate a transcriptome and stop there?

0
Entering edit mode

What is the long term goal of your experiment? To generate a transcriptome and stop there?

Currently, we have two goals: Assembly of a transcriptome for our species, that could possibly be used in future studies and using that transcriptome ourselves for DGE and network analysis.