Bacteria WGS via illumina short read - is it normal to have 0 coverage in some regions?
2
0
Entering edit mode
9 months ago
hwen7 ▴ 10

(EDITED 8/4/23 for clarity)

I recently miniprepped a supernatant from lysed E coli (due to prophage induction) and decided to submit the sample for NGS (2 x 150bp). (Note that this sample was pooled with similarly lysed samples from other species to get the most out of my sequencing run)

When the .fastq data came back, I trimmed the bulk data (trimmomatic), mapped reads back to individual genomes using BWA. Used samtools to get a .bam file, sorted it, then visualized on IGV just to have a sense of how the reads mapped.

In the picture below where the reads were mapped back to E coli, the first row is the 'coverage' map. And the second row is the actual reads. I am only showing a representative 20kb segment of the genome, and you can clearly see some regions getting far better coverage, wherereas some regions have completely no reads. This was almost throughout the genome when I observed by eye. I was wondering if this is normal? The reference genome I used for read mapping was data I got back also from nanopore WGS from a company, so it should be good

enter image description here

However in the picture mapped back to B subtilis (B subtilis was also prophage-lysed and miniprepped in a similar manner prepared in the same way above and pooled with the E coli sample for submission), the data looks much better: enter image description here

NGS illumina WGS • 1.6k views
ADD COMMENT
0
Entering edit mode

This does not look like WGS data. The coverage should not be sparse like this. What is the expected size of the genome and how much data (in terms of gigabases) did you end up with after trimming?

miniprepped the whole genome of strain of interest

Did you actually do a simple mini-prep and not do any other purification/effort to keep the genomic DNA intact/high-MW? Did you make the libraries or did your sequence provider make the libraries?

ADD REPLY
0
Entering edit mode

Thanks for the reply!

Expected size of genome is 4.2Mbp (E coli) and this particular sample resulted in a sorted .bam file of 820mb. Because this sample was pooled together with other samples from other species, the data came back as a bulk and was trimmed together. So I cant say for sure how much of the trimming was due to this E coli sample. However the bulk fastq files were trimmed from 1.8gb/1.8gb(f+r) to 1.6gb/1.6gb(f+r)

I apologize maybe I used the wrong term. This sample was actually from bacteria supernatant (containing lysed bacteria due to prophage induction), which I cleaned up using a total DNA kit (https://www.zymoresearch.com/products/quick-dna-miniprep). I did the library prep myself using a NEBNext Ultra kit.

My goal was actually to see how much DNA mapped back to prophage and how much mapped back to backbone DNA.

ADD REPLY
0
Entering edit mode

Because this sample was pooled together with other samples from other species, the data came back as a bulk and was trimmed together.

What does this mean? Your sample did not have its own index? I hope it had its own index otherwise any analysis would be meaningless, if you had a mixed sample.

My goal was actually to see how much DNA mapped back to prophage and how much mapped back to backbone DNA.

If you had the correct references you should have been able to find the answer to this.

The top screenshot you posted above is misleading and does not provide a complete picture. What do other areas of the genome look like?

Edit: Did you add the second screenshot after you did the original posting. I don't recall seeing that before. But that is how WGS data should look.

ADD REPLY
0
Entering edit mode

What does this mean? Your sample did not have its own index? I hope it had its own index otherwise any analysis would be meaningless, if you had a mixed sample.

Each sample did indeed have its own index, so they were separated out.

The top screenshot you posted above is misleading and does not provide a complete picture. What do other areas of the genome look like?

The other areas look similar, shown above is a representative portion, but observing by eye, most regions had this sort of spotty mapping.

Edit: Did you add the second screenshot after you did the original posting. I don't recall seeing that before. But that is how WGS data should look.

Yes after your post I edited my original post to provide more information. Indeed the second read mapping was good. I was surprised since I followed the exact same workflow for both samples. I wonder whether it means the reference I'm mapping to in the first picture is bad...?

ADD REPLY
0
Entering edit mode

The other areas look similar,

This is puzzling. Can you post result of samtools idxstats your.bam? How many reads did you have (number and length) and how many of them mapped to the genome (%). E. coli is not an odd bacterium and your reads should align at a relatively high % not matter which E. coli genome you use.

ADD REPLY
0
Entering edit mode
For E coli:
Escherichia_coli_genome 
4526074 949668  7624
*   0   0   9085814

For B subtilis (which had good mapping)
Bacillus_subtilis_genome    
4215606 1001983 2085
*   0   0   9042358

I also used samtools stats and visualized some of the data. Here is what the table looked like for E coli (spotty mapping in the picture in original post), B subtilis (good mapping in the picture in original post), and S typhimurium (not shown in original post, but it was another sample that had phenomenal mapping)

enter image description here

ADD REPLY
0
Entering edit mode

This is getting stranger. < 10% mapping is not great for a WGS sample (unless it is grossly contaminated with something else or is a metagenomic sample etc). What the heck are the rest of the 90% reads? Did you take few and blast @NCBI to check?

With WGS we expect the opposite to be normally true. 90+% reads mapping and the remaining smaller fraction not.

Which aligner did you use and are these paired end reads?

ADD REPLY
0
Entering edit mode

The way the samples are prepped were not true WGS minipreps. They were supernatants taken from lysed bacteria --> cleaned up with a generic DNA cleanup kit --> pooled with other samples from other species (e.g E coli samples were pooled together with B subtilis and Salmonella typhimurium) and sequenced. The bulk fastq data was mapped back only to the source genome though.

I guess of the bulk pooled 10m reads, 10% went to E coli, 10% went to B subtilis and 50% went to salmonella...in that sense it is a 'metagenomic' sample in that it was artifically pooled.

There were 2x150bp paired ends reads. I used bwa-mem aligner

ADD REPLY
0
Entering edit mode

I see. I thought the tables above were discreet stats for three samples. So the total you are showing above is for all reads. This is not how we generally represent data for independent "samples".

Sorry to be pedantic but did these samples have their own indexes and were separate library preps?

The bulk fastq data was mapped back only to the source genome though.

It sounds to me like when you say "bulk fastq" you seem to be referring to a pool of all samples. Keep in mind that aligners will make best effort to align sequence data. If the data contains reads that originated from a different genome they may still be aligned by the aligner to a heterologous genome. The aligner has no way of discriminating beyond the alignment it sees.

Considering all this it seems that you are only getting parts of the genome showing up in the supernatant for E coli, whereas the other two genomes seem to be represented well. Reason? Non efficient lysis of E coli?

ADD REPLY
1
Entering edit mode

Sorry to be pedantic but did these samples have their own indexes and were separate library preps?

This is my first time trying to troubleshoot these sorts of experiment so I really appreciate all the questions thank you! The samples for which I am showing data for did not have their own index. The plan was to differentiate the samples via mapping them to different genomes, with the assumption only E coli reads (in the 'bulk fastq') will be mapped back to E coli, etc.

I know I mentioned indexed earlier one, but use of index for this experiment was only for a different treatment group. Within each treatment group, no index was used. (Again, with the thought to differentiate them by genome)

here's what i mean:

It sounds to me like when you say "bulk fastq" you seem to be referring to a pool of all samples. Keep in mind that aligners will make best effort to align sequence data. If the data contains reads that originated from a different genome they may still be aligned by the aligner to a heterologous genome. The aligner has no way of discriminating beyond the alignment it sees.

I see...perhaps this is where maybe I had an incorrect assumption. I assumed E coli reads would only map back to E coli, since it is perhaps different enough from B subtilis and Salmonella. Like I mentioned above, this was how I thought to differentiate samples --> should I have used an index for different strains? Although how would one explain that salmonella still had very good mapping despite the bulk fastq data coming from a mixed bag

Considering all this it seems that you are only getting parts of the genome showing up in the supernatant for E coli, whereas the other two genomes seem to be represented well. Reason? Non efficient lysis of E coli?

Indeed thats possible. it turn out for E coli, while there was spotty mapping, the region harboring the lambda prophage had an extremely high coverage and not at all spotty like the other regions...i wonder if that 'took up' most of the E coli reads

ADD REPLY
1
Entering edit mode

with the assumption only E coli reads (in the 'bulk fastq') will be mapped back to E coli, etc.

You can't be 100% certain with short reads. You did not map the data to all three genomes at the same time so you don't know how much multi-mapping there is going to be. Within the genome there is not a lot (the reads with MQ=0, which I assume stands for mapping quality, are multi-mapping reads).

should I have used an index for different strains?

Yes each sample should have been indexed separately. That way you know each read came from which sample/genome. When you are dealing with multiple samples/genomes you will need to know that. Also when you are going to try and compare amongst groups you need to be able to make sure that the comparison is valid and that you can do appropriate normalization on the data for comparisons.

Although how would one explain that salmonella still had very good mapping despite the bulk fastq data coming from a mixed bag

There is no scientific explanation. Could entirely be by chance. It looks like the "prep" you did somehow worked better for that strain than others (assuming you did exactly the same prep for all at the same time). If you repeated the experiment (you will need biological reps BTW later) then it may be some other genome or all of them.

the region harboring the lambda prophage had an extremely high coverage and not at all spotty like the other regions

Virus must have had a chance to replicate during the lytic cycle and its genome being relatively small ended up with more copies in the prep.

i wonder if that 'took up' most of the E coli reads

Probably not. Generally there is always an excess of library reagents so you are not going to run out of those. Every fragment of DNA in a sample would have an equal chance of being made into a library fragment. If the viral DNA is present in much greater amount than bacterial frags then that is what you are going to see.

Ratio of viral to bacterial reads is something you will need to account for in your comparisons. Depending on what this experiment is about, it may be an important consideration.

ADD REPLY
1
Entering edit mode
9 months ago

This is only a missing 27kbp region of the whole related genome I'd guess.Pull in the gene annotation in gff3 or gtf format to check which genes are missing. And yes, 0 coverage regions are completely normal (think deletions in the isolate of interest, ie reads, vs the reference).

If this reference sequence is supposed to be from the exact same isolate where you got the short reads from, then perhaps the assembly is poor/wrong in this region. By the way, the regions with coverage are likely to be false positives here (lots of snps and indels in the reads vs the genomic average, which you don't show here).

In future, I would suggest using nanopore or pacbio to sequence your genomes, since whole genomes are likely to be de novo assembled in just a few contigs, vs hundreds of contigs if you use short reads. Short reads are great for resequencing of course, so finding SNPs and population variability.

ADD COMMENT
0
Entering edit mode
9 months ago
Prash ▴ 280

Excellent points by GenoMax and thread. May I suggest Fastq screen for multi-genome mapping? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6124377/ Nevertheless, I'd prefer taking this approach with nanopore, avoid PCR and getting longer reads. Also pulling the annotation using gff is a great deal as suggested by colindaven !

ADD COMMENT

Login before adding your answer.

Traffic: 1599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6