SRA screening for parasite contamination
1
1
Entering edit mode
8 months ago
e.bessette ▴ 10

Dear all,

I am writing to you with the hope that someone could help me on my PhD project: I have started a PhD in insect pathology and I have to establish a microsporidia - insect model (I don't have specific microsporidia species for my project). Thus, my current work aims to assess the diversity of microsporidia (fungi like parasites) that may affect reared insect for feed and food (such as locusts, crickets, mealworms etc.).

The literature is full of experimental infections but I would like to find more "natural infections". To do so I am using lab techniques such as microscopy and PCR with insect samples from pet shops. My supervisor and I would also like to use bioinformatic approach with the screening of data from the Sequence Read Archive (SRA, NCBI), where a lot of data are available but not worked. I have zero experience in bioinformatic / genomic studies, that is why I am looking for help and inspiration.

So, I screened some SRA data with the hope to find microsporidia species as "contaminants" in insect SRA data: I used microsporidia 18S rRNA gene sequences (registered in NCBI as "16S" because of the gene's short length, but microsporidia are eukaryotes) as query sequences for BLAST analysis (AF069063 as an example of query sequence) against SRA runs (all the work is done with the NCBI web tool). I selected insects' SRA projects which had used a Whole Genome Sequence approach, because a lot used 16S amplicon based sequencing and are unusable for microsporidia screening, since they are eukaryotes.

For now I don't have great results: I found a lot of hits with low Query Cover and high Percent Identities, where most of the time hits correspond to insects, plants or fungi. I can find a hit which will correspond to a Microsporidia but also to other eukaryotes, and then the Microsporidia does not seem specific to the insect I screened.

Therefore I have several questions:

1. Are the rRNA gene sequences specific enough for my screening?
2. Can I use rRNA gene sequences as query sequences against RNA-Seq SRA projects?
3. Could I use another free software (rather than the NCBI web tool) or approach to have more specific outputs, i.e. only microsporidia? Besides it is tedious to look at the BLAST results for each query sequence
4. Could I use directly Microsporidia proteins which would be more specific for BLAST search? If yes, how could I make BLAST analysis with Proteins data against Nucleotide data (i.e. WGS from SRA projects).

Thank you for your responses and insights. Best wishes, Edouard

blast data screening rRNA parasite host • 242 views
0
Entering edit mode
8 months ago
Mensur Dlakic ★ 14k

While rRNA can be used for mapping, you may want to increase the chance of recruiting some reads by increasing the size of your reference sequence. I suggest bbsplit.sh script from the BBMap package, or mirabait which is part of MIRA.

Let's say you collect all DNA sequences you can find for your organism (genomic DNA works fine), and put them in a file called myseq.fa. If the paired-end reads you want to investigate for contamination are in all_reads.fastq, you could try these two commands for their respective packages:

bbsplit.sh build=1 in=all_reads.fastq ref_x=myseq.fa out_x=output_reads.fastq