2
0
Entering edit mode
6 weeks ago

I’m simulating shotgun reads for RNA data (e.g. I have a bunch of bacterial genomes and I want to generate simulated reads). Should I use as my input only the “coding sequences” of these reference genomes (from NCBI Nucelotide DB)? Or should I use the complete genomes? If I were simulating DNA reads, I’d use the complete fasta; for RNA I think I should use only the coding sequences. If I should use only coding sequences, should I concatenate the proteins and just keep one header?

Thanks!

2
Entering edit mode
6 weeks ago
Mensur Dlakic ★ 23k

If you are interested in strain differences only in coding DNA, then your simulation can be from predicted genes. That would be with an explicit understanding that only SNPs within coding regions would be measured, and that may be all you care about. I would not use complete genomes.

You will need to simulate more than 1000 reads. A general rule of thumb is that we need at least 50-100x coverage for short sequencing technologies to be able to assemble de novo. If the length of your total coding sequence is 1.2 million bases, for 1x coverage (on average) you need [1,200,000 / 150] reads (comes to 8000 reads of length 150 bp). Realistically, for these genomic parameters you would need at least 400,000 simulated reads, and 800,000 would be even better.

Both 85-88% and 92% coding densities are common for bacteria. Plasmids must have at least one origin of replication, and sometimes have two of them, which given the ratio of replication origin to plasmid size lowers the coding potential compared to genomic DNA.

1
Entering edit mode

Thanks very much for your time and explanation. I really appreciate it. If I could trouble you with one more question: how do people learn these rules of thumb? From reading and synthesizing many papers? Are there other sources for learning these things? I know I can learn them by designing and conducting small experiments with publicly-available data, but I find that I often don't have enough time to do experiments before my experiments...I do them later, in my spare time, but they don't help with the problem in front of me. I feel like every physics textbook has the information to solve my physics problems if I look and think hard enough, but the same is not true of bioinformatics, I think? Thank you again for any hints/recommendations. :)

1
Entering edit mode

I have learned too many things to be able to point out how exactly I learned this or that. Like most things in life, it is a combination of formal learning in the classroom and experiential learning by trial and error. Lucky for the younger generations, Google can find almost anything very quickly. It sure cuts down on time required to go to the library, browse through the shelves, and then still go through an actual book or journal.

A general solution is to research thing before doing them. Googling synthetic shotgun DNA reads or simulating shotgun DNA sequencing will give you plenty of information for all aspect of the process. There will be papers describing theoretical aspects of the process, guidelines for error percentage, depth of reads, etc. There will also be links to programs that can do this, and even to scripts that you can copy and modify as needed. I don't mean to sound like your parent, but trust me on this one: it is much easier to find information these days and learn something in do-it-yourself fashion than it was ever before.

1
Entering edit mode
6 weeks ago
Mensur Dlakic ★ 23k

I don't think you should use complete genomes. Still, not sure that using coding sequences will suffice either. I guess it depends on the purpose of your simulation: whether you want this to be very realistic, or just need a collection of sequences to test existing or new software.

First, transcripts are longer than just the coding sequence because of 5' and 3' UTRs. Second, many transcripts in bacteria are polycistronic, so there is a continuum between several coding sequences, including intergenic regions. Not sure how one would simulate that without knowing, or having predictions for, promoter and terminator sequences.

0
Entering edit mode

Thanks, Mensur. The purpose of this simulation is to see whether I can do strain typing for a specific strain of a bacteria species within a metatranscriptomic sample, and at what level of presence I can still detect it. I was thinking of doing several simulations (e.g., 1. generate 1,000 shotgun reads, 2. add specific numbers to my metatranscriptomic sample - say, 50, 100, 500, 1000 reads, 3. shuffle, 4. try to do strain typing by blasting against VFDB), and assess the results. I'm generating 150bp sequences. From what you say, it seems like if I use coding sequences I run the risk of underestimating because some of the simulated reads will be nonsensical (i.e. the end of one coding sequences running straight into the beginning of the next coding sequence, and not respecting the polycistronic nature of the transcripts). But if I use complete genomes, I have a lot of reads that wouldn't be present in an RNA-extracted sample. Right? Are there other limitations I've missed?

So, I calculated the percent of coding sequences for my bacterial genome as well as its plasmids (so I took number of lines in the complete genome - number of lines in the coding sequences, minus the headers of course). It's 92% coding sequence, and the plasmids are 85-88% coding sequence. Perhaps a good solution is use the complete genome, and generate 2,000 reads instead of 1,000?

A side note: is it common for plasmids to be 85-88% coding? I expected more than 92% coding sequence.

Traffic: 2283 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.