Question: Mapping a chromosome position with Gene ID
1
gravatar for Paul
27 days ago by
Paul10
India
Paul10 wrote:

I am trying to map chromosome positions with all the gene IDs available.

The position file1 I have looks like this

#CHROM      POSID       REF ALT
NC_008596.1 41669   .   CG  C
NC_008596.1 78925   .   GC  G
NC_008596.1 101262  .   GC  G
NC_008596.1 115332  .   GC  G
NC_008596.1 224262  .   A   AC

and the genome position file2 looks like the following

    ChrM        Start   Stop Strand GeneID
chr NC_008596.1 499     1692    +   4535615
chr NC_008596.1 1721    2614    +   4536178
chr NC_008596.1 2624    3778    +   4532650
chr NC_008596.1 3775    4359    +   4537198
chr NC_008596.1 4591    6618    +   4531510
chr NC_008596.1 6648    9176    +   4535988
chr NC_008596.1 9229    10011   +   4533293
chr NC_008596.1 10184   10276   +   4535541
chr NC_008596.1 50411   11211   +   4531499

Now I need to map the positions in file1 (POS ID) with the GeneID given in file2.

The POS ID in file 1 is the number which lies between start and stop codon number. For e.g.41669 lies between

chr NC_008596.1 50411   11211   +   4531499

and the output should be

chr NC_008596.1 50411   11211   +   4531499  41669

the last one is POSID

Please let me know how can I do it? I can work in R, so any solution in R or an excel trick would be helpful.

sequencing snp excel next-gen R • 190 views
ADD COMMENTlink modified 4 days ago • written 27 days ago by Paul10

did you look at merge in R? https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

ADD REPLYlink written 27 days ago by Santosh Anand1.5k

to use merge, I think we need to have common column in both the files. Here, POS ID in file1 is the number which lies between Start(499) and Stop(1692) in file 2.

ADD REPLYlink written 27 days ago by Paul10

So the POS ID of file 1 should lie in between the POS of file 2? What is the criteria of merge?

ADD REPLYlink written 27 days ago by Santosh Anand1.5k

I just updated with an example

ADD REPLYlink written 27 days ago by Paul10

have a look to reduce in R?

data<-Reduce(function(...) merge(..., all=TRUE, by=c("POSID", "GeneID")), list(file1, file2))

ADD REPLYlink written 27 days ago by Lila M 230
3
gravatar for Santosh Anand
27 days ago by
Santosh Anand1.5k
Santosh Anand1.5k wrote:

If you are working on range-intersection, genomic ranges are very cool and fast, although it requires a little bit of learning: See below posts:

http://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window

If efficiency is not your issue ( => files are not big), then you can run a for-loop for each POSID in file1 to see if if it falls in the same range as file2.

ADD COMMENTlink modified 4 days ago by zx87543.6k • written 27 days ago by Santosh Anand1.5k

Thanks Santosh, It worked

ADD REPLYlink written 4 days ago by Paul10
1
gravatar for Pierre Lindenbaum
4 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum92k wrote:

convert your files to BED format using awk and the use bedtools intersect to compute the intersection

ADD COMMENTlink written 4 days ago by Pierre Lindenbaum92k
1
gravatar for Alex Reynolds
4 days ago by
Alex Reynolds18k
Seattle, WA USA
Alex Reynolds18k wrote:

Save some time and use BEDOPS.

First, clean up and sort your text files:

$ awk '(NR>1){print $1"\t"$2"\t"($2+1)"\t"$2}' file1.txt | sort-bed - > file1.bed
$ tail -n+2 file2.txt | sort-bed - > file2.bed

Then map the IDs from the first BED file file1.bed to the intervals in the second BED file file2.bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' file2.bed file1.bed > answer.bed

This should give you the exact format you want.

ADD COMMENTlink written 4 days ago by Alex Reynolds18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour