If you have a sorted BED file containing genes and their coordinates, you can use BEDOPS
bedmap to map positions in a VCF file to those genes.
The overall process is:
- Convert the VCF file to BED format with the
- Get genes from somewhere and put them into BED format
- Map the converted file to the genes
$ vcf2bed < positions.vcf > positions.bed
Now grab some genes and turn them into a BED file. Let's say you are working with mouse (build
mm10). We'll take Ensembl gene annotations:
$ wget -O - ftp://ftp.ensembl.org/pub/mnt2/release-71/gtf/mus_musculus/Mus_musculus.GRCm38.71.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
You'd modify this step as needed for whichever organism and build you are working with. We use
gtf2bed here to convert the GTF-formatted Ensembl records to sorted BED, and we then filter the output specifically for genes.
We have two sorted, BED-formatted inputs: the VCF-sourced positions and our gene annotations. We're now ready to use
bedmap to do mapping:
$ bedmap --echo --echo-map-id positions.bed mm10.genes.bed > answer.bed
answer.bed is a sorted BED file that contains the mutation positions and a semi-colon-delimited list of gene IDs which overlap the position:
[ mutation-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ mutation-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
[ mutation-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]
Where no genes overlap a mutation, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.
If you want the entire gene record — and not just the ID — use
--echo-map in place of
--echo-map-id. Other mapping
--echo-* options are available, depending on what you want
bedmap to report.