I routinely expect that a 1GB transfer of data across the internet to be done without a single bit flip.
As a bioinformatic newbie, I have no reference point for how repeatable genomic sequencing currently is. This generates a series of questions:
1) If I have my genome fully sequenced at two different, state of the art sequencing centers (e.g. the best in america, the best in europe) how similar would the results be (and how should we define and measure similarity)? (I'm interested in reference genome quality sequences, not e.g. clinical quality)
2) For that matter, if I use the same center, but just on two different occasions separated by a a couple of weeks, what should I expect?
3) Instead of human genome, suppose a strain of E. coli was being analyzed -- how does that effect the repeatability?
4) Naively I would assume that I would receive back data organized in 46 contiguous chunks. Pragmatically, what should I expect? -- e.g. you'll get haploid data, and the avg # of contigs per chromosome will be x.
5) Genomic Coordinate Systems: For me, similarity between two sequencing runs means first, that there is computational simple way to translate the coordinates of a region of interest in one data set to the other, and second, the sequence data matches. For instance, how does this work for humane genome GRCh37 vs GRCh38 -- how would I go about translating coordinates for e.g. all the opsins in GRCh37 to their coordinates in GRCh38, and how likely would it be that there would be a single nucleotide mismatch for this limited portion of dna?
Hope those questions make sense. Thanks! nehal
thanks, that is very helpful!
Can you explain what you mean by "library" e.g. "are these two separate libraries (from the same sample)" My naive model was based on the idea that tissue is taken (e.g. some blood is drawn) and then sent to the two different labs -- I'm sure it's more complicated than that, and implicit in your comments, it seems, first I convert the sample into "a library" and then I send that the lab. This completely makes sense that that would likely make the results highly comparable as a lot of non-reproducibility I assumed would be related to handling raw tissue would not be relevant -- but, alas, what is this "library" you speak of? (Apologies for the delayed response, I was unaware that biostars did not send emails by default) cheers
Library
is a collection of sequencing-ready DNA fragments that are generated from a sample (e.g. DNA that came from blood in your case). This is done by addition of special oligonucleotide adapters to ends of DNA fragments. The fragmentation can be done by enzymatic/sonication (or other) means. The method used for sonication will determine distribution of fragment sizes you get which will affect the quality of libraries made from them.If one technician creates two library preps from a sample of DNA they could be slightly different. Two independent labs (and thus two people) making DNA from a sample (divided into two blood tubes and ultimately made into libraries) would lead to different libraries as well.
That said, if the libraries are of good quality then at the end of the day you should get similar results if you are mapping to a known reference genome.
de novo
analysis results may be affected to a greater extent by differences in the libraries.