Question

How To Create A .Bed File With A Few Candidate Genes?

0

Entering edit mode

10.2 years ago

newDNASeqer ▴ 760

I am interested in creating a bed file that includes only a few genes (< 10) for use with the GATK variant calling pipeline. I am going to use it as a genome interval file, because I do not need to scan the entire genome/exome for variants. Can someone here give me some help with creating such a bed file? Thanks

gene • 8.9k views

ADD COMMENT • link updated 10.2 years ago by DG 7.3k • written 10.2 years ago by newDNASeqer ▴ 760

score 2 · Answer 1 · 2014-01-27

If it is less than 10 it may be easier to make it manually than programmatically. All you really need for the file is a tab-delimited format that looks like:

chr start_pos end_pos

So to make a manual one just look up the genomic coordinates of your genes. Should probably also include an ID column as the fourth column. That helps if you also use DiagnoseTargets or DeothOfCoverage from the GATK. You can make these BED files per gene or per exon depending on your preference and experiment.

Everything else is fairly optional in terms of the GATK. That said what I would probably do is generate a full list of say, all CCDS genes from the UCSC tables in BED format to have "on hand", including gene names. You could then automatically parse that file anytime you want to generate a new BED format for a subset of genes.

UCSC Table Browser

Select Group: Genes and Gene Predictions Select the gene definitions (track) you want to use. Default is UCSC genes Make sure region is set to genome and output is BED. Download to file. You will then be asked how to split up BED records. One record (line) per gene, one per exon, etc. Whichever you want is fine, you may want one per gene and may want to include upstream/downstream bases depending on experiment. If doing an exome you probably might want one record per exon and include a few bases to either side to capture canonical splice-sites.

Fourth column of the BED file in this case will be the ID. If you selected CCDS it will unfortunately just be CCDS ids and not gene names, but you can look up all of the CCDS id's for genes of interest. You can also generate a table with ID conversions from Ensembl's BioMart interface if desired.

Because BED files are just tab-delimited, if you have a list of appropriate IDs to read from you can easily parse the BED file to a subset of ID matches with your favourite scripting language of choice.