Question: how to assign SNPs to each gene with overlapping intervals in r?
2
gravatar for Ana
4 months ago by
Ana170
Ana170 wrote:

Hi all, I have ~900K SNPs and I want to assign them to genes based on their positions. here is a very simple example of what I am doing:

the first data set contains SNPs and their positions on each chromosome (here I am showing only chromosome 1, but I have 17 chromosomes in total)

  data.snps
    chrom   pos
    1   5
    1   11
    1   19
    1   39
    1   74
    1   77
    1   90

The second file contain genes, with start and stop positions on each chromosome (again just for simplicity I am showing only one chromosome)

data.genes
chrom   gene_id start   stop
1   g1  2   8
1   g2  14  20
1   g3  29  35
1   g4  37  46
1   g5  50  63
1   g6  70  75
1   g7  87  93

I want to assign each SNP on each chromosome to any of these genes on that chromosome, if it falls within the range of any of genes. I am doing this way:

data_snps$found <- ifelse(sapply(data_snps$pos, function(p) 
                 any(data_genes$start <= p & data_genes$stop >= p)),data_genes$gene_id, NA)

which gives me the output that I want

output
 chrom pos assigned
     1   5     1
     1  11    NA
     1  19     3
     1  39     4
     1  74     5
     1  77    NA
     1  90     7

However there is a problem that I am not sure how to solve it. Some of these genes overlap with each other and SNPs can fall within the range of 2 or more genes at the same time, for example in this example below g1 and g2 overlap or g2 and g5 overlap, the first SNP can fall within the range of both g1 and g2.

data.genes.with.overlap

chrom   gene_id start   stop
1   g1  2   8
1   g2  4   20
1   g3  29  35
1   g4  37  46
1   g5  18  63
1   g6  70  75
1   g7  87  93

I want want an output to assign these SNPs to each of these overlapping genes

this is my desired output for example: output.desired

chrom pos assigned.A.  assigned.B.    assigned.c  ...(if there are multiple genes that a sop falls within their range)
         1   5     1           2
         1  11    NA       NA
         1  19     3          5
         1  39     4.         4
         1  74     5          5
         1  77    NA       NA
         1  90     7          7

I sincerely appreciate you help. Thanks

R • 217 views
ADD COMMENTlink modified 4 months ago by Alex Reynolds28k • written 4 months ago by Ana170
3
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

It sounds like you want to assign genes (or gene IDs) to SNPs. This can be done with a mapping tool, and doing this on the command line may get you there faster:

$ bedmap --echo --echo-map-id --skip-unmapped --delim='\t' snps.bed genes.bed > A.bed
$ bedops --not-element-of 1 snps.bed A.bed | awk -vOFS="\t" '{ print $0, "NA" }' > B.bed
$ bedops --everything A.bed B.bed > answer.bed

You'll first need to put SNPs and genes into sorted and UCSC BED format to use this approach, but R's write.table() call will get you to tabular output and sort-bed will sort that file.

The file answer.bed can be brought back into R via read.table().

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 782 users visited in the last hour