Question

What is the range of coverage depth of bacteria from environment sample?

0

Entering edit mode

7.8 years ago

thustar ▴ 130

Hi biostars!

I want to do some simulation of short-gun sequencing of bacteria genome. However, I do not know the range of coverage depth of bacteria from a real sample. For example, the minimal, maximum, and average coverage depth in real data will be helpful. By the way, coverage depth is the concept in next-generation sequencing.

Thanks

gene next-gen-seq reads Assembly • 1.8k views

ADD COMMENT • link 7.8 years ago by thustar ▴ 130

1

Entering edit mode

It varies a lot. The complexity of the sample can change between two samples depending on origin, isolation protocol, DNA extraction and more. You can find public data instead of simulating it.

ADD REPLY • link 7.8 years ago by Asaf 10k

0

Entering edit mode

Thanks.

Where could I find such dataset? And could I find related statisitical report on the dataset?

I want to simulate mostly because I want to have a groundtruth,i.e. I can know which genome an assembled sequence comes from. But in real data, if two or more genomes share some repeated subsequence, I could not know how the short reads involve in this situation, which is exactly what I want to figure out.

ADD REPLY • link 7.8 years ago by thustar ▴ 130

1

Entering edit mode

Of general interest would be this journal: http://msystems.asm.org/

I had done a simulation with a mix of 20 bacterial genomes last year. With BBMap suite (randomreads.sh to generate reads at varying depth from this pool of genomes (1-50 million reads of varying quality). In most cases it was possible to map the reads back to reference with good accuracy. BBSplt was also able to assign the reads to unique bins. About 2% reads were multi-mappers (not much can be done about those in terms of assignment) and were not assignable to a bin.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks for provide much information.

I am very interested in your previous work about simulation. I think I am doing similar task. Do you have a detailed report? It will be helpful to compare our results.

Besides, The link you provide is about environmental bacteria community. It seems to concentrate more on diversity of bacteria (i.e. total number of bacteria species) than relative abundance among different bacteria species. The latter one is critical to validation of simulation. Did you know other reports related to that?

ADD REPLY • link 7.8 years ago by thustar ▴ 130

1

Entering edit mode

You can easily do the simulation yourself. Get the genomes (in my case they were diverse genera, if I recall right), create a multi-fasta file, use randomreads.sh to simulate dataset in terms of read length and depths you want and then map the reads back to the "mega-genome" using BBMap. With BBsplit use individual genomes.

Note: Took out the link for the paper in post above since it was not of direct interest.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot.

I have written python script based on pysam by myself. The simulated short-gun reads could be assembled well. However, these short reads are error-free. It is still different from reality. I was wondering how to mimic the real error noise in the short reads, for example, how to mimic the errors Illumina machine tend to make. For example, Illumina x tends to make a mistake to read 'A' instead of real 'T' at the possibility 1% (This is an imaginary case, not a real case.)

ADD REPLY • link 7.8 years ago by thustar ▴ 130