without a little work you will not reach your goal. The most difficult part here is, to clarify who said what is an exon? UCSC, NCBI, Ensembl,...? And do you want just coding exons or all?
Let's assume you like all exons defined by NCBI.
- Go to UCSC Table browser
- Choose hg19 in the assembly field, "Genes and Gene Predictions" in group and "NCBI RefSeq" in track.
- Choose Bed as the output format, and give the output file a name e.g. exons.bed
- Click on get output
In the next dialog choose "Exons".
Now you have a file called
exons.bed which contain the coordinates.
What we have to do now, is to sort this file by position and remove the "chr" from the chromome names. You can do it like this:
cut -c4- exons.bed|sort -k1,1V -k2,2g -k3,3g > exons_sorted.bed
This file we can use to query the 1000 Genomes file directly on the ftp server using
tabix -R exons_sorted.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > exon_variants.vcf
If this is to slow you have to download the compressed vcf file and the tabix index file to your pc and adopt the tabix command.
modified 6 months ago
6 months ago by
finswimmer ♦ 6.2k