Question: Mimic WES on WGS data + A comprehensive list of coding regions in the most biologically relevant transcripts
0
gravatar for lamteva.vera
8 months ago by
lamteva.vera190
Ukraine, Kyiv
lamteva.vera190 wrote:

Dear peers,

I have the results of WGS in the FASTQ, BAM and VCF formats to be interpreted using the commercial analysis platform. For the sake of the cost-effectiveness, I have to restrict the data to coding regions - sort of mimick WES. What would be the best way to do it?

So far I've come up with the preliminary solution to extract from VCF only those variants in coding exons of canonical transcripts ±12 intronic bp. A few questions:

  1. How to make up such a BED file? Is there one already existing? Apart from the technical side of creating such a file, I'm confused with the lack of consensus on canonical transcripts, not mentioning the difference in coordinates between UCSC and Ensembl. Should I use the MANE, LRG, APPRIS P1, Ensembl Golden or TSL:1 transcripts or the ones at the intersection of these datasets?

  2. Can I use the same approach for extracting coding portion of a BAM file? How should I do it?

Thank you for any suggestions. Cheers, Vera

ADD COMMENTlink modified 8 months ago • written 8 months ago by lamteva.vera190

In general this approach does not seem right to me because why, however, you can take all the transcripts from the UCSC, intersect them using bed intersect and "limit" your analysis to the region of interest.

2 - why would you do this? what will it change? your vcf - in theory - should remain the same.

ADD REPLYlink modified 8 months ago • written 8 months ago by German.M.Demidov1.8k

Several members have invested effort into this question, it is therefore bad practice to delete the question. Others might benefit from it. Just leave it as it is.

ADD REPLYlink written 8 months ago by ATpoint38k
1
gravatar for WouterDeCoster
8 months ago by
Belgium
WouterDeCoster44k wrote:

It would not be correct, because the technologies are different. The coverage pattern in WES is not comparable to what you get from WGS if you just limit to certain intervals. This is not going to be a valid analysis.

That said, if you still want to do this, you need to (1) get the target file for the WES platform of interest (e.g. Agilent SureSelect, Roche SeqCap), which are often already in bed format. Making your own target file is going to less accurate than using the one which is effectively targeted by the probes in your WES capture. And then (2) you can use samtools view to extract the reads corresponding to the targeted intervals.

ADD COMMENTlink written 8 months ago by WouterDeCoster44k

Thank you for your answer. Could you kindly explain it further? I'm aware of the difference in coverage depth uniformity between WGS and WES but is it a concern after the secondary analysis has been done? I mean, I have a list of variants and want to annotate and further interpret only coding and splice-site ones. How can pre-filtering by coordinates do any harm? Is your answer regarding only BAM processing, not VCF?

ADD REPLYlink modified 8 months ago • written 8 months ago by lamteva.vera190

The VCF will be derived from the BAM, so yes that also affects the variant calling. You cannot mimic how the coverage would turn out by probe capture, so your variant calling will not be truly reflective of how the exome sequencing would be.

If your coverage is different in WES vs WGS then also your variant calls (and quality scores) will be affected.

ADD REPLYlink written 8 months ago by WouterDeCoster44k

But I already have a VCF from WGS, so not variant calling is needed. Can't I annotate only variants of interest - that is, those in coding regions?

ADD REPLYlink written 8 months ago by lamteva.vera190

For me your request sounds like you know how to annotate variants in non-coding regions, but don't want to do so. Basically, in my opinion, whatever annotation you apply, it will be a 99% annotation of your coding part. So this question overall does not make much sense IMO.

ADD REPLYlink written 8 months ago by German.M.Demidov1.8k

We use a sequencing service provider who does wet-lab as well as secondary analysis. We get from them FASTQ, BAM and VCF files. For the further interpretation of cases, we use a commercial platform that charges per FASTQ size / VCF number of variants. Thus the analysis of the WGS data is not cost-effective because the patient paid for WES analysis only. We do not provide WGS, it was the initiative of our partner to do WGS instead of requested WES. Does it make more sense now?

ADD REPLYlink written 8 months ago by lamteva.vera190
1

ok, from now it starts to sound really, really bad for me. the patient paid for WES (usually clinical grade WES is around 100x), you did a WGS (usually it is around 30-42x) and now you try to mimic WGS for WES without telling this to the patient. and by usage the word "patient" I suppose it is a medical procedure. am I right?

ADD REPLYlink written 8 months ago by German.M.Demidov1.8k

I am not yet trying to mimic anything, I am trying to figure out the proper solution.

ADD REPLYlink written 8 months ago by lamteva.vera190

I think the proper solution in this situation is to communicate with either partner or patient. I know bioinformaticians from Kyiv and I know how problematic it is there to do analysis, but in your particular situation it has to be done through the communication with others.

ADD REPLYlink written 8 months ago by German.M.Demidov1.8k

We do not provide WGS, it was the initiative of our partner to do WGS instead of requested WES.

Ah but that makes the procedure a lot easier. You call your partner and demand them they do WES. Especially if this is a clinical application. They/You shouldn't just use another technology.

ADD REPLYlink written 8 months ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 718 users visited in the last hour