Question

Filtering parasite reads from host reads

0

Entering edit mode

8 months ago

m.habib • 0

Hi all,

I have RNA-seq PE fastq reads for control and infected (at different time points) samples. The host species is lacking a reference genome. Thus, I will be doing a de novo transcriptome assembly and later differential gene expression analysis.

My questions:

how to get the host only reads to be used for generating the transcriptome with trinity?
how to specify the percentage of parasite reads in my samples?

In my initial trials, I downloaded and indexed the parasite genome and aligned my preprocessed reads (corrected, trimmed) to the index using HISAT2 but I noticed that the alignment rate is not logic (50-75%) given that the parasite used to infect the samples is very minute (in number and size compared to the host size).

Here are parts of the codes I used: - indexing after obtaining the splicing sites and list of exons files I ran the following:

hisat2-build -p 20 --ss $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.ss --exon $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.exon $DATA_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.fna $OUT_DIR/GCF_000699445.3_UoM_Shae.V3_genomic.index

alignment

hisat2 -x $INDEX/GCF_000699445.3_UoM_Shae.V3_genomic.index --threads 20 --rna-strandness RF \
-1 "$DATA_DIR/${i}R1.fq.gz" -2 "$DATA_DIR/${i}R2.fq.gz" \
--bowtie2-dp --phred33 --very-sensitive \
 -k 10 --end-to-end \
 -S "$OUT_DIR/${i}.sam"

I was planing to use samtools to get the mapped (parasite) and unmapped reads (host) but I got confused with the high percentage of alignment. I appreciate also any suggestion for samtools flags to be used for the separation of host and parasite reads.

Thanks

tools RNA-seq • 1.4k views

ADD COMMENT • link updated 8 months ago by GenoMax 141k • written 8 months ago by m.habib • 0

2

Entering edit mode

I got confused with the high percentage of alignment

Looks like the parasite is an eukaryote but what about the host? Unless you are working with some exotic species there is bound to be some data (may be for a related species if not the same) in the databases. What kind of prep was used for the library prep (poly-A capture, ribodepletion?).

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

I can expect some similarity but not that high!. Both are eukaryotic.The host is a snail mollusk and the parasite is a trematode. Poly-A capture was used for library prep.

ADD REPLY • link 8 months ago by m.habib • 0

1

Entering edit mode

You could try to change alignment parameters as suggested by @liorglic but that may not fix this issue instantly. The end result may likely stay similar. It is possible that for whatever reason the RNA from the parasite made it into libraries preferentially and that is what you have. Since the libraries are already made that can't be changed short of making another library.

Since these are short reads you could still be getting crossmapping of host RNA with the parasite so this analysis is going to be tricky.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thanks @GenoMax for your comments. I also tried bbduk filtering using the parasite fasta file against the PE reads.

bbduk.sh -Xmx100g in1=$DATA_DIR/"${base}"_R1.fq.gz in2=$DATA_DIR/"${base}"_R2.fq.gz out1=$OUT_DIR/"${base}"_clean.R1.fq.gz out2=$OUT_DIR/"${base}"_clean.R2.fq.gz outm1=$OUT_DIR/"${base}"_bad.R1.fq.gz outm2=$OUT_DIR/"${base}"_bad.R2.fq.gz ref=/users/habibm/taos-scratch/GCA_000699445.3_UoM_Shae.V3_genomic.fna stats=$OUT_DIR/"${base}" k=31

I am not sure if using bbduk is a vaild approach bu I got percentage of parasite reads in my samples that seems logic (~0.4-0.5%) but the problem is that there was no difference between controls and treatments samples. I have the same pattern in the replicates (n=7) of control and different time points of infections.

ADD REPLY • link 8 months ago by m.habib • 0

0

Entering edit mode

bbduk.sh is not appropriate to separate out reads like this. You should use bbsplit.sh for this application (use genome for the host that is close to the one you are using + genome of parasite) or even bbmap.sh with just parasite genome.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

There is a genome for a very similar host species that I can use but is this reliable? Do you have suggestions for the parameters to be used with bbsplit.sh with paired end reads?

ADD REPLY • link 8 months ago by m.habib • 0

0

Entering edit mode

There is a genome for a very similar host species that I can use but is this reliable?

It is better than having no reference at all. The similar it is the better the result will be.

Use the default parameters for bbsplit.sh to begin with. Pay attention to the ambig2= parameter since it decides what happens with reads that align to both genomes. You may want to keep reads that align to both genomes since you can't be sure which one they come from. If you choose to "toss" them then you may lose data. Use refstats=<file> option and write results to a file and then post those stats here.

Here is some info: BBSplit syntax for generating builds for the reference genome and how to call different builds.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thanks @GenoMax for your patience and time. Is the following approach correct?

1- indexing the two genomes

bbsplit.sh ref=parasite_genome.fasta
bbsplit.sh ref=host_genome.fasta

2- classification of PE reads

bbsplit.sh in1=reads_R1.fq in2=reads_R2.fq \
          ref=/path/to/parasite_genome/index,ref=/path/to/host_genome/index \
          basename=out_%.fq \
          outu1=clean1.fq outu2=clean2.fq \
          ambig2=all minratio=0.56 minhits=1 maxindel=16000 refstats=<file>

Based on basename=out_%.fq I should get two files for host_reads.fq and parasite_reads.fq then I use reformat.sh to obtain r1 and r2, right? what about the clean reads? what should I do with reads that does not map to either genomes? it could be reads related to differential expression!

for the ambiguous reads, I want to keep those that map to each reference as shared reads added to the host genome. is ambig2=all the right option for that? ambig2=all Write a copy to the output for each reference to which it maps.

ADD REPLY • link 8 months ago by m.habib • 0

1

Entering edit mode

You don't need to pre-index the genomes unless you want to run this multiple times. Try the following. We don't want too many reads going into the "clean" files (that name was in ref to something else, you could call them "uninteresting", since they should not be from either of your genomes). Those could be assembled separately to see what you get later.

bbsplit.sh in1=reads_R1.fq in2=reads_R2.fq \
          ref=/path/to/parasite_genome/parasite.fa,/path/to/host_genome/host.fa \
          basename=out_%.fq \
          outu1=clean1.fq outu2=clean2.fq \
          ambig2=all maxindel=16000 refstats=<file>

Set maxindel= to a number that you know the average intron size is for your organism. For eukaryotes it could be as large as 100k(b). You may need to specify memory explicitly by -XmxNNg in case you run out of memory. You may need 30-60GB of RAM depending on size of your data and genome sizes.

You can add .gz extension to all fastq files to keep files compressed. BBMap will be fine with that.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Hi GenoMax I am still running the script:

 bbsplit.sh in1=$DATA_DIR/${i}R1.fq.gz in2=$DATA_DIR/${i}R2.fq.gz \
ref=parasite.fna,snail.fna \
basename=$OUT_DIR/${i}_out_%.fq.gz \
outu1=$OUT_DIR/clean1_${i}.fq.gz outu2=$OUT_DIR/clean2_${i}.fq.gz \
ambig2=all maxindel=1900 refstats=$OUT_DIR/${i}_refstats.txt

I adjusted the maxindel=1900 which is the average intron length for both the snails (1.3-2 kb) and parasite (1.9kb).

Almost half of the sample are done now. Attached is a photo for sample stats file for one of the control replicates and one of the infected (6 hr post infection) replicates

Attached is a photo for sample stats file for one of the control replicates and one of the infected (6 hr post infection) replicates .

I am still waiting for the all samples to be finished. But based on these stats it looks like even the control (not infected with the parasite) has some parasite reads. I assume that these may be shared reads with other contaminants within the control samples.

Now, how to move forward with these results?

For de novo transcriptome of host reads only should I merge the snail reads with the clean.fq.gz files (which does not map to either the snail or the parasite) because it may represent reads specific to my snail species and not shared with the closely related species that I used in the alignment!

How to calculate the % of parasite reads in my samples?

ADD REPLY • link 8 months ago by m.habib • 0

1

Entering edit mode

At least now this result is more in line with what your expectation was (more host and much less parasite). Looks like you will get plenty of reads here from all of your samples to pool and do a de novo assembly using trinity for the snail? The percentage of ambiguous reads is small enough that you probably can set them aside for now.

I would suggest merging "clean" reads and assembling them separately to see what you get. At least in round 1. Looks like you are getting at least 20-25% in this category? Are snails/parasites cultivated aseptically, if not these may simply be contaminants that you can't use.

You seem to have much less of parasite reads (2500 odd compared to 30-40 M for the host). If you consider the "clean" reads to be also host then the % will drop even further.

ADD REPLY • link 8 months ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax . The snails (live in freshwater) and parasites (develop in snails and rodents) are subject for contamination from water and during processing (infection and later on RNA extraction!).

The percentage of ambiguous reads is small enough that you probably can set them aside for now.

These are also included in each of the snail and parasite reads, right? because I want to keep them in the snail reads.

But I am also concerned that there might be some reads specific to the species!

I would suggest merging "clean" reads and assembling them separately to see what you get.

what is the purpose of assembling the clean reads separately?

to get the percentage of the parasite reads within each sample should it be:

N=number of unambiguous parasite reads / number of assigned snail reads + number of clean snail reads (from clean2_${i}.fq.gz)

ADD REPLY • link 8 months ago by m.habib • 0

2

Entering edit mode

Ah yes, you did ambig2=all so those reads would be in both bins. That number is small (~2500) compared to rest of the data so that should not cause big issues.

what is the purpose of assembling the clean reads separately?

Those reads were neither snail or parasite. While you hope there are reads specific for you species in there based on where the snails live (and how they were collected) there is a definite chance that the reads may be from contaminants. So keeping them separate, assembling and the blasting a few will give you an idea of what may be in there. Depending on what you find those may ultimately be non-usable or ambiguous with respect to the snail or the parasite. You may need to do additional experiments to prove that they came from parasite or snail.

ADD REPLY • link 8 months ago by GenoMax 141k

score 1 · Answer 1 · 2023-08-08

I am not familiar with the specific HISAT parameters (I usually use STAR), but I suspect the the high ratio of alignment is because your definition for what reads are considered mapped is too permissive. As you suggested, try using samtools to filter the alignment results, e.g. based on MAPQ, alignment length, number of mismatches etc. It would be hard to suggest specific cutoffs, as they may differ based on the specific data, so you could, for example, plot a histogram of MAPQ values and see if any "natural" cutoffs can be detected.

Another option is to use a dedicated tool for assigning reads to taxa. One popular option is Kraken2, which uses k-mers to determine the source of reads. Since your host and parasite are from different phyla, it shouldn't be too hard to tell them apart.