Trouble filtering CDS regions from Nextera Enrichment design
7.6 years ago
idedios ▴ 30

So I designed a panel on Illumina DesignStudio where I have over 200 genes that each need to be trimmed of non-coding exons. Going about it manually would take a couple weeks since the design has about 12,000 probes. For the probes that I filtered manually, I used IGV to look at the reference hg19 UCSC genome and check for amino acid sequences for the regions of the probes.

I wanted to know if there is an easier way of doing this, by using a shell script to parse my probe regions file and compare it to the reference genome without using IGV to view the reference.

7.5 years ago
rbagnall ★ 1.7k

Hi Idedios,

You can get a bed file of coding regions from your genes from UCSC table browser. Select the following options..

group - genes and gene predictions

track - RefSeq genes

table - refGene

region - genome

identifiers - paste list (and paste a list of gene names in the new window)

output format - BED (browser extensible data)

output file - coding_regions.bed

click 'get output' and select Coding exons and click get BED

This will give a list of coding exons for each gene in BED format. You could then compare this list to a bedfile of your illumina probes using BED tools, for example. Use intersectBed to retrieve only the coding regions of your illumina probes

intersectBed -a coding_regions.bed -b illumina_probe_regions.bed > coding_illumina_probe_regions.bed

