Question

Newbie question about sequencing repeatability/genomic coordinate systems

1

Entering edit mode

6.0 years ago

nehal.alum ▴ 10

I routinely expect that a 1GB transfer of data across the internet to be done without a single bit flip.

As a bioinformatic newbie, I have no reference point for how repeatable genomic sequencing currently is. This generates a series of questions:

1) If I have my genome fully sequenced at two different, state of the art sequencing centers (e.g. the best in america, the best in europe) how similar would the results be (and how should we define and measure similarity)? (I'm interested in reference genome quality sequences, not e.g. clinical quality)

2) For that matter, if I use the same center, but just on two different occasions separated by a a couple of weeks, what should I expect?

3) Instead of human genome, suppose a strain of E. coli was being analyzed -- how does that effect the repeatability?

4) Naively I would assume that I would receive back data organized in 46 contiguous chunks. Pragmatically, what should I expect? -- e.g. you'll get haploid data, and the avg # of contigs per chromosome will be x.

5) Genomic Coordinate Systems: For me, similarity between two sequencing runs means first, that there is computational simple way to translate the coordinates of a region of interest in one data set to the other, and second, the sequence data matches. For instance, how does this work for humane genome GRCh37 vs GRCh38 -- how would I go about translating coordinates for e.g. all the opsins in GRCh37 to their coordinates in GRCh38, and how likely would it be that there would be a single nucleotide mismatch for this limited portion of dna?

Hope those questions make sense. Thanks! nehal

gene sequencing • 1.2k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 6.0 years ago by nehal.alum ▴ 10

0

Entering edit mode

6.0 years ago

Alex Reynolds 35k

how likely would it be that there would be a single nucleotide mismatch for this limited portion of dna?

If you don't have annotations for your genome of interest and you use liftOver, one approach is to lift coordinates in both directions and see what matches on the return trip, keeping unambiguous matches.

That is, given B = liftOver(A, ref1>ref2) and C = liftOver(B, ref2>ref1), what concordance is there between sets A and C? You might take the portion of B where A and C map reciprocally, through set operation tools (BEDOPS, etc.).

Similarity measurements between subsets (e.g., opsins) should tell you the odds of keeping a random SNP or point mutation in that subset.

ADD COMMENT • link 6.0 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks -- l was unaware of liftover -- with that pointer, I have a good starting point to understand how the process works. cheers

ADD REPLY • link 6.0 years ago by nehal.alum ▴ 10

score 2 · Accepted Answer · 2018-05-07

2

Entering edit mode

6.0 years ago

GenoMax 141k

Results would be similar but not identical because of the stochastic nature of sequencing. I assume the same library is being sequenced at both centers or are these two separate libraries (from same sample)?
Nothing specific. As long as the sequencing works well data should look comparable. Technical replication for Illumina sequencing is consistent as long as libraries are appropriately handled/stored.
No specific effect. DNA is DNA.
You will get raw sequence data in fastq format of a read length equal to the number of cycles used (provided the data was not pre-trimmed). That is the standard deliverable with most sequencing centers unless you make arrangements with them to do assembly/analysis. You will get one alignment file (per sample) when the data is aligned to a standard reference. One vcf file if SNP's were called using that alignment (all chromosomes will generally be in that one file). If you ask them to do assembly then don't expect to get 46 files representing the diploid genome. It could be any number depending on how good the data was and how well the assembly worked.
LiftOver from UCSC.

ADD COMMENT • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

thanks, that is very helpful!

Can you explain what you mean by "library" e.g. "are these two separate libraries (from the same sample)" My naive model was based on the idea that tissue is taken (e.g. some blood is drawn) and then sent to the two different labs -- I'm sure it's more complicated than that, and implicit in your comments, it seems, first I convert the sample into "a library" and then I send that the lab. This completely makes sense that that would likely make the results highly comparable as a lot of non-reproducibility I assumed would be related to handling raw tissue would not be relevant -- but, alas, what is this "library" you speak of? (Apologies for the delayed response, I was unaware that biostars did not send emails by default) cheers

ADD REPLY • link 6.0 years ago by nehal.alum ▴ 10

0

Entering edit mode

Library is a collection of sequencing-ready DNA fragments that are generated from a sample (e.g. DNA that came from blood in your case). This is done by addition of special oligonucleotide adapters to ends of DNA fragments. The fragmentation can be done by enzymatic/sonication (or other) means. The method used for sonication will determine distribution of fragment sizes you get which will affect the quality of libraries made from them.

If one technician creates two library preps from a sample of DNA they could be slightly different. Two independent labs (and thus two people) making DNA from a sample (divided into two blood tubes and ultimately made into libraries) would lead to different libraries as well.

That said, if the libraries are of good quality then at the end of the day you should get similar results if you are mapping to a known reference genome. de novo analysis results may be affected to a greater extent by differences in the libraries.

ADD REPLY • link 6.0 years ago by GenoMax 141k