I have the results of WGS in the FASTQ, BAM and VCF formats to be interpreted using the commercial analysis platform. For the sake of the cost-effectiveness, I have to restrict the data to coding regions - sort of mimick WES. What would be the best way to do it?
So far I've come up with the preliminary solution to extract from VCF only those variants in coding exons of canonical transcripts ±12 intronic bp. A few questions:
How to make up such a BED file? Is there one already existing? Apart from the technical side of creating such a file, I'm confused with the lack of consensus on canonical transcripts, not mentioning the difference in coordinates between UCSC and Ensembl. Should I use the MANE, LRG, APPRIS P1, Ensembl Golden or TSL:1 transcripts or the ones at the intersection of these datasets?
Can I use the same approach for extracting coding portion of a BAM file? How should I do it?
Thank you for any suggestions. Cheers, Vera