Question

Mimic WES on WGS data + A comprehensive list of coding regions in the most biologically relevant transcripts

0

Entering edit mode

4.3 years ago

lamteva.vera ▴ 220

Dear peers,

I have the results of WGS in the FASTQ, BAM and VCF formats to be interpreted using the commercial analysis platform. For the sake of the cost-effectiveness, I have to restrict the data to coding regions - sort of mimick WES. What would be the best way to do it?

So far I've come up with the preliminary solution to extract from VCF only those variants in coding exons of canonical transcripts ±12 intronic bp. A few questions:

How to make up such a BED file? Is there one already existing? Apart from the technical side of creating such a file, I'm confused with the lack of consensus on canonical transcripts, not mentioning the difference in coordinates between UCSC and Ensembl. Should I use the MANE, LRG, APPRIS P1, Ensembl Golden or TSL:1 transcripts or the ones at the intersection of these datasets?
Can I use the same approach for extracting coding portion of a BAM file? How should I do it?

Thank you for any suggestions. Cheers, Vera

BED canonical transcript VCF BAM • 1.3k views

ADD COMMENT • link 4.3 years ago by lamteva.vera ▴ 220

0

Entering edit mode

In general this approach does not seem right to me because why, however, you can take all the transcripts from the UCSC, intersect them using bed intersect and "limit" your analysis to the region of interest.

2 - why would you do this? what will it change? your vcf - in theory - should remain the same.

ADD REPLY • link 4.3 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Several members have invested effort into this question, it is therefore bad practice to delete the question. Others might benefit from it. Just leave it as it is.

ADD REPLY • link 4.3 years ago by ATpoint 81k

score 1 · Answer 1 · 2020-01-03

1

Entering edit mode

4.3 years ago

WouterDeCoster 47k

It would not be correct, because the technologies are different. The coverage pattern in WES is not comparable to what you get from WGS if you just limit to certain intervals. This is not going to be a valid analysis.

That said, if you still want to do this, you need to (1) get the target file for the WES platform of interest (e.g. Agilent SureSelect, Roche SeqCap), which are often already in bed format. Making your own target file is going to less accurate than using the one which is effectively targeted by the probes in your WES capture. And then (2) you can use samtools view to extract the reads corresponding to the targeted intervals.

ADD COMMENT • link 4.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for your answer. Could you kindly explain it further? I'm aware of the difference in coverage depth uniformity between WGS and WES but is it a concern after the secondary analysis has been done? I mean, I have a list of variants and want to annotate and further interpret only coding and splice-site ones. How can pre-filtering by coordinates do any harm? Is your answer regarding only BAM processing, not VCF?

ADD REPLY • link 4.3 years ago by lamteva.vera ▴ 220

0

Entering edit mode

The VCF will be derived from the BAM, so yes that also affects the variant calling. You cannot mimic how the coverage would turn out by probe capture, so your variant calling will not be truly reflective of how the exome sequencing would be.

If your coverage is different in WES vs WGS then also your variant calls (and quality scores) will be affected.

ADD REPLY • link 4.3 years ago by WouterDeCoster 47k

0

Entering edit mode

But I already have a VCF from WGS, so not variant calling is needed. Can't I annotate only variants of interest - that is, those in coding regions?

ADD REPLY • link 4.3 years ago by lamteva.vera ▴ 220

0

Entering edit mode

For me your request sounds like you know how to annotate variants in non-coding regions, but don't want to do so. Basically, in my opinion, whatever annotation you apply, it will be a 99% annotation of your coding part. So this question overall does not make much sense IMO.

ADD REPLY • link 4.3 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

We use a sequencing service provider who does wet-lab as well as secondary analysis. We get from them FASTQ, BAM and VCF files. For the further interpretation of cases, we use a commercial platform that charges per FASTQ size / VCF number of variants. Thus the analysis of the WGS data is not cost-effective because the patient paid for WES analysis only. We do not provide WGS, it was the initiative of our partner to do WGS instead of requested WES. Does it make more sense now?

ADD REPLY • link 4.3 years ago by lamteva.vera ▴ 220

1

Entering edit mode

ok, from now it starts to sound really, really bad for me. the patient paid for WES (usually clinical grade WES is around 100x), you did a WGS (usually it is around 30-42x) and now you try to mimic WGS for WES without telling this to the patient. and by usage the word "patient" I suppose it is a medical procedure. am I right?

ADD REPLY • link 4.3 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

I am not yet trying to mimic anything, I am trying to figure out the proper solution.

ADD REPLY • link 4.3 years ago by lamteva.vera ▴ 220

0

Entering edit mode

I think the proper solution in this situation is to communicate with either partner or patient. I know bioinformaticians from Kyiv and I know how problematic it is there to do analysis, but in your particular situation it has to be done through the communication with others.

ADD REPLY • link 4.3 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

We do not provide WGS, it was the initiative of our partner to do WGS instead of requested WES.

Ah but that makes the procedure a lot easier. You call your partner and demand them they do WES. Especially if this is a clinical application. They/You shouldn't just use another technology.

ADD REPLY • link 4.3 years ago by WouterDeCoster 47k