Where can I get a list of SNPs mapping overlapping genes in humans?
Where can I get a list of SNPs mapping overlapping genes in humans?
Given files genes.bed
and snps.bed
, you could do something like:
$ bedmap --echo --echo-map-id --delim '\t' genes.bed snps.bed > answer.bed
The file answer.bed
will contain the gene annotation and a semi-colon delimited list of SNP identifiers that overlap each gene.
In order to get genes.bed
, you could use Gencode v44 (hg38):
$ wget -qO- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' - \
| convert2bed -i gff --attribute-key="gene_name" - \
> genes.bed
To get snps.bed
, you could use dbSNP (v151):
$ for chrom in `seq 1 22` X Y; do \
wget -qO- https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/bed_chr_${chrom}.bed.gz \
| gunzip ---stdout - \
| sort-bed - \
> hg38.snp151.chr${chrom}.bed; \
done
$ bedops -u hg38.snp151.chr*.bed > snps.bed
Then run the bedmap
statement above.
If you want SNPs that overlap regions only where genes overlap, then you can do the following, instead:
$ bedmap --echo --echo-map-id --delim '\t' <(bedops --intersect genes.bed) snps.bed > answer.bed
The file answer.bed
will contain the intersection of regions where genes overlap and a semi-colon delimited list of SNP identifiers that overlap each intersection-region.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Note: The file
snps.bed
will be very large. You'll need sufficient disk space for this step.OP is looking for overlapping genes - genes with presumably different gene IDs that share some loci. I think the
genes.bed
creation logic needs to take that into account.I modified the answer with an approach for that use case.