I’m simulating shotgun reads for RNA data (e.g. I have a bunch of bacterial genomes and I want to generate simulated reads). Should I use as my input only the “coding sequences” of these reference genomes (from NCBI Nucelotide DB)? Or should I use the complete genomes? If I were simulating DNA reads, I’d use the complete fasta; for RNA I think I should use only the coding sequences. If I should use only coding sequences, should I concatenate the proteins and just keep one header?
If you are interested in strain differences only in coding DNA, then your simulation can be from predicted genes. That would be with an explicit understanding that only SNPs within coding regions would be measured, and that may be all you care about. I would not use complete genomes.
You will need to simulate more than 1000 reads. A general rule of thumb is that we need at least 50-100x coverage for short sequencing technologies to be able to assemble de novo. If the length of your total coding sequence is 1.2 million bases, for 1x coverage (on average) you need [1,200,000 / 150] reads (comes to 8000 reads of length 150 bp). Realistically, for these genomic parameters you would need at least 400,000 simulated reads, and 800,000 would be even better.
Both 85-88% and 92% coding densities are common for bacteria. Plasmids must have at least one origin of replication, and sometimes have two of them, which given the ratio of replication origin to plasmid size lowers the coding potential compared to genomic DNA.
If either or both of my answers solved your problem, please upvote them and/or accept the answer.
I don't think you should use complete genomes. Still, not sure that using coding sequences will suffice either. I guess it depends on the purpose of your simulation: whether you want this to be very realistic, or just need a collection of sequences to test existing or new software.
First, transcripts are longer than just the coding sequence because of 5' and 3' UTRs. Second, many transcripts in bacteria are polycistronic, so there is a continuum between several coding sequences, including intergenic regions. Not sure how one would simulate that without knowing, or having predictions for, promoter and terminator sequences.
Thanks very much for your time and explanation. I really appreciate it. If I could trouble you with one more question: how do people learn these rules of thumb? From reading and synthesizing many papers? Are there other sources for learning these things? I know I can learn them by designing and conducting small experiments with publicly-available data, but I find that I often don't have enough time to do experiments before my experiments...I do them later, in my spare time, but they don't help with the problem in front of me. I feel like every physics textbook has the information to solve my physics problems if I look and think hard enough, but the same is not true of bioinformatics, I think? Thank you again for any hints/recommendations. :)
I have learned too many things to be able to point out how exactly I learned this or that. Like most things in life, it is a combination of formal learning in the classroom and experiential learning by trial and error. Lucky for the younger generations, Google can find almost anything very quickly. It sure cuts down on time required to go to the library, browse through the shelves, and then still go through an actual book or journal.
A general solution is to research thing before doing them. Googling
synthetic shotgun DNA readsor
simulating shotgun DNA sequencingwill give you plenty of information for all aspect of the process. There will be papers describing theoretical aspects of the process, guidelines for error percentage, depth of reads, etc. There will also be links to programs that can do this, and even to scripts that you can copy and modify as needed. I don't mean to sound like your parent, but trust me on this one: it is much easier to find information these days and learn something in do-it-yourself fashion than it was ever before.