Question: Importance Of Consistency Of Downstream Analysis Of Sequence Data
2
gravatar for Pi
9.6 years ago by
Pi520
Pi520 wrote:

Greetings

When a lab performs an investigation to sequence a population of individuals, is it typically the case that every individual in the population will be sequenced using the same platform (e.g. solid/illumina). I am wondering if cases exist where some individuals in an investigation are sequenced using a different platform.

I am assuming all individuals would be sequenced using the same instrument for consistency. But then you also have to assume all post-sequencing analysis is the same (e.g. the reads are assembled with the same pipeline) if you want consistency?

My point of asking this is because I am interested whether it is considered 'acceptable' to treat the individuals in a population differently because of how it affects subsequent calculations such as allele and genotype frequencies.

I've never worked directly on a sequencing project and have only been given the data to analyse after all this work has been done. The prior steps affect data quality (e.g. some pipelines are considered noisier than others) so it must affect the quality of the subsequent calculations. Or are the figures for allele and genotype frequencies just to imprecise for this to matter?

So to summarise, how you can assess the quality of a variation study if all data wasn't gathered using the same protocol. Are there guidelines for this?

Thank-you for your time

edit: Thanks for your answers. With regard to guidelines I was also wondering if there were guidelines for how to describe the pipeline used to sequence data and the associated parameters/thresholds that may vary with the pipeline or is it just a case documenting it? Is there a need among the community for such guidelines if they don't exist or is general documentation sufficient

sequence quality variation • 1.8k views
ADD COMMENTlink modified 9.3 years ago by lh332k • written 9.6 years ago by Pi520
4
gravatar for Casey Bergman
9.6 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

Unless their is substantial platform-dependent systematic sequencing error (as a hypothetical example, Solid preferentially generates a high rate of C->T mistakes) then my intuition is that the variance on the evolutionary process due random effects of mutation, genetic drift, and sampling will lead greater variation than that observed from individuals being sequenced on different platforms. This is of course speculation, and would need to be evaluated empirically. As a start, you could read the Harismendy et al paper carefully to see if there is anything in there that suggests platform specific systematic errors that are greater than the mutation rate of your organism.

ADD COMMENTlink written 9.6 years ago by Casey Bergman18k
1

good point: human variability may be higher than the differences introduced by different platforms. I'll have to read about that, so thank you for the paper suggestion. addendum: I would say that using the same software pipeline would be then mandatory, in order not to introduce more differences on the process.

ADD REPLYlink written 9.6 years ago by Jorge Amigo12k

@Jorge, I 100% agree about using the same analysis pipeline - at least there are some things that are in our control to standardize as bioinformaticians.

ADD REPLYlink written 9.6 years ago by Casey Bergman18k
2
gravatar for Jorge Amigo
9.6 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

sure using the same platform/protocol/pipeline would be desirable in order to minimize possible existing errors, but some big projects just can't afford it (take a look to 1000 Genomes for instance). the normalization process becomes then the main challenge, the most important one in my honest opinion, since once you merge results from different platforms you forget about which one was more error prone, which one was more lax on SNP calling, ...

as far as I know there are no guidelines written yet on this matter, although it is a general consensus among people I've talked to that are working with NGS data coming from different sources that at least the default thresholds for each platform must be slightly raised before putting all the results on the table, plus the normalization of some basic experiment's variables such as coverage, base quality or variant density.

EDIT: after reading Casey Bergman's answer I realized I maybe didn't myself completely clear. it is true that differences introduced by different platforms may not be as significant as the intrinsic differences among samples just due to human variability (this makes sense to me, so I'll fetch some readings like the one suggested by Casey), but the way those variants are detected may vary when using certain software. the suggestion I was trying to come here with is to try using the same software pipeline for all platforms, so you can be confident at least on the algorithms and the stringencies imposed to the results, which would be shared among all platforms' results. since no all the mapping and variant calling tools work with raw data coming from all different platforms you will probably have to do some effort converting raw data into some common format (fastq for instance), so some quality check to make sure things have gone fine at the lab, and then you should be able to process everything being relatively confident.

ADD COMMENTlink modified 9.6 years ago • written 9.6 years ago by Jorge Amigo12k
1
gravatar for lh3
9.6 years ago by
lh332k
United States
lh332k wrote:

ALWAYS try to use the same technology for consistent results. While there are fluctuations between individuals, those are unbiased. Artifacts caused by using different technologies are biased, which are far more hurting. I would not recommend Harismendy et al paper. It was good at the time of publication, but is now outdated. I have seen a couple of papers/manuscripts misled by that.

ADD COMMENTlink written 9.6 years ago by lh332k

Can you provide evidence to support your claim of misleading results based on the Harismendy et al paper?

ADD REPLYlink written 9.6 years ago by Casey Bergman18k

I reviewed two manuscripts that used the data set in the Harismendy et al paper. Both got rejected (not for using the data in Harismendy et al of course). One manuscript assumed the base error rate in Harismendy et al is typical, but the error rate is quite high in today's standard. The other manuscript assumed targeted sequencing and whole genome resequencing lead to the same results. Note that in both cases, there is nothing wrong with Harismendy et al itself.

ADD REPLYlink written 9.6 years ago by lh332k

I reviewed two manuscripts that used the data set in the Harismendy et al paper. Both got rejected (not for using the data in Harismendy et al of course). One manuscript assumed the base error rate in Harismendy et al is typical, but the error rate is quite high in today's standard. The other manuscript assumed targeted sequencing and whole genome resequencing lead to the same results. Note that in both cases, there is nothing wrong with Harismendy et al itself. Just nowadays, it is not the most representative data set.

ADD REPLYlink written 9.6 years ago by lh332k

Thanks, Heng. Is there a better publication on error rates in NGS? Evaluation Of High Throughput Sequencing Error Rates ?

ADD REPLYlink modified 14 months ago by _r_am30k • written 9.6 years ago by Casey Bergman18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1479 users visited in the last hour