Tutorial:How to find the mapping percentages for data deposited in the Zaire ebolavirus bioproject from the 2014 outbreak
1
7
Entering edit mode
7.8 years ago

I was working on a material that I put together as an example of studying the Ebola data and strangely code that ran fine in the Fall now did not produce any results. Quite the head scratcher ...

I ended up troubleshooting it for a while only to realize that most of the new data deposited into the main Ebola project does not actually map to Ebola. At all. In fact 541 files map at rates below 1%!

Below is the code used to evaluate the mapping percentages for all data (891 files) in this project, see the results at the end). Hopefully it might save some time to someone else.

ebola bwa • 2.2k views
1
Entering edit mode
7.8 years ago
piet ★ 1.8k
0.25 bam/SRR1735115.bam 100 + 0 properly paired (0.25%:nan%)


I have repeated the mapping for a read-set you have found only 100 reads for. I mapped SRR1735115 to KR817241.1 with bwa mem. I found 7690 reads mapping. This is a 40-fold coverage in average (the Ebola genome is only 18000 nt in size). I inspected the mapping with Tablet. The reads show a very high error rate. I have never inspected RNA sequencing experiments before. Maybe that is normal? Nevertheless, I could clearly identify a few SNP with respect to the reference genome (which is from the 2014 outbreak).

In my opinion the challenge is sample preparation (as always). Ebola is a RNA virus. You have to prepare RNA from human blood serum, then reverse-transcribe it into DNA. Isolating RNA from blood is tricky and error-prone.

You stated that all samples were positive in qPCR (quantitative PCR). PCR is very sensitive. If PCR is positive, this does not mean that there is enough RNA in the serum for sequencing. With qPCR you can determine the viral load. The viral load varies by several orders of magnitude depending on the stage of the disease, but we currently know little about this for Ebola. It would be very interesting to compare the viral loads estimated from qPCR with the results from read mapping.

1
Entering edit mode

I was curious to find out where the more than 2.5 million reads not mapping to Ebola come from. First I mapped them to some human genes (dhfr, actb, 18S rRNA), but does not got any hits. The serum seams to be free of human RNA. Then I mapped to some bacterial rRNA operons from several phyla (proteobacteria, firmicutes, actinomycete). I got diffuse mapping with thousands of reads for all of them. Thus the major fraction of RNA in this sample is bacterial rRNA from a broad spectrum of species. Presumably the serum was stored a room temperature for an elevated period of time.

I also noted that all of the read sets with high amount of reads mapping to Ebola belong to the 99 fully assembled sequences submitted to Genbank along with the Science paper. The other read sets seam to be just the trash of that great study.

0
Entering edit mode

That's pretty cool. Thanks for sharing, I wish more studies had posts like this were we can talks about it.

Also since we are talking about it, here is the original post that I was trying to reproduce: Mission Impossible: you have 1 minute to analyze the Ebola Genome As it turns out this time one needs to pick the right data.

0
Entering edit mode

I've selected the first 20K (paired) reads from the samples so that I could map all samples (891). I wonder was the mapping percentage 0.25% a good approximation?. But you are right in that this may be sufficient. I have also noted low sequencing accuracy for the samples with low coverage.

0
Entering edit mode

The ratio of mapped reads is 0.3 % in my mapping, which is very close to your result obtained with only the first 20K.