Question: Target cancer sequencing: can't find database with raw data and related publications
0
gravatar for mariafirulevabio
9 months ago by
mariafirulevabio40 wrote:

Dear all,

I'm trying to find raw target sequencing data (bam or fastq and vcf/gvcf) of any cancer types. I want to get publications associated with these data, because I need info about confirmation process description of allele frequency of founded variants in vcf (e.g., digital PCR). However, databases which I know don't provide biological validation of stored NGS data.

Hope you can help me.

ADD COMMENTlink modified 9 months ago • written 9 months ago by mariafirulevabio40

I am not sure if these raw data are available, because of privacy policies.

ADD REPLYlink written 9 months ago by Benn7.9k

These data can be under controlled access (e.g., TCGA).

ADD REPLYlink written 9 months ago by mariafirulevabio40

Yes, indeed, and do you have access? If not, you can download some raw FASTQ data from cancer studies at SRA, process these, and then produce your own BAMs and VCFs,

ADD REPLYlink written 9 months ago by Kevin Blighe56k

I need annotated VCF files (or BAM with variant calling description, if it was done) as in silico control for bioinformatics pipeline and, also, I need wet lab confirmation of observed variant allele frequencies. I guess, my question is similar to this post.

ADD REPLYlink written 9 months ago by mariafirulevabio40

For NGS data, you may struggle to find a normal sample for whom the variants have been confirmed in the wet lab. If you can imagine, validating all variants would be a costly and time-consuming task. GIAB (Genome in a Bottle) have samples for whom variants have been confirmed in parallel by multiple variant calling methods, but these are neither confirmed in the wet lab.

If you search the online repositories (mainly SRA - sequence read archive), then you may find what you need.

What in the other post (by Cyriac) is not 100% in line with what you need, or does the post by Cyriac 100% address your question?

ADD REPLYlink written 9 months ago by Kevin Blighe56k

Is a biological validation a costly and time-consuming task for variants from targeted sequencing? Can validation be performed only for the pool of interested variants (e.g., hot spots)?

Cyriac addressed to NCI's GDC Legacy Archive for validated BAM files, however, it is a bioinformatic validation. I found another Cyriac post, but I can't find files related to the second point of "How TCGA MAFs are made" header.

ADD REPLYlink written 9 months ago by mariafirulevabio40
0
gravatar for mariafirulevabio
9 months ago by
mariafirulevabio40 wrote:

Since I've not found the answer, I guess these links (post, paper) will be useful for someone with the same aims. I've decided to use a mixture of two Genome in a Bottle samples (truth set is available) for somatic variant calling validation. There is an option to choose a desired gene panel and filter variants in both truth set and output from alignment and variant calling pipeline.

I would appreciate any pieces of advice related to my original question and strategy which I described in this post.

UPD: another useful link.

ADD COMMENTlink modified 9 months ago • written 9 months ago by mariafirulevabio40
2

Just keep in mind that, despite the Genome in a Bottle calling their datasets 'truth sets', they most likely still contain false positive and negative calls. Their 'truth' sets were defined by processing the same samples multiple times with difference sequencers; however, each sequencer has its own associated error.

ADD REPLYlink written 9 months ago by Kevin Blighe56k
1

Thanks, Kevin! I suppose it is better for me to choose a target panel which doesn't overlap complicated regions (repeats, high GC-content sequences, etc).

ADD REPLYlink written 9 months ago by mariafirulevabio40

Indeed, particularly repeat sequence / regions with sequence similarity (there are many!)

ADD REPLYlink written 9 months ago by Kevin Blighe56k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1860 users visited in the last hour