Question: Mapping a chromosome position with Gene ID
2
gravatar for Paul
7 months ago by
Paul50
India
Paul50 wrote:

I am trying to map chromosome positions with all the gene IDs available.

The position file1 I have looks like this

#CHROM      POSID       REF ALT
NC_008596.1 41669   .   CG  C
NC_008596.1 78925   .   GC  G
NC_008596.1 101262  .   GC  G
NC_008596.1 115332  .   GC  G
NC_008596.1 224262  .   A   AC

and the genome position file2 looks like the following

    ChrM        Start   Stop Strand GeneID
chr NC_008596.1 499     1692    +   4535615
chr NC_008596.1 1721    2614    +   4536178
chr NC_008596.1 2624    3778    +   4532650
chr NC_008596.1 3775    4359    +   4537198
chr NC_008596.1 4591    6618    +   4531510
chr NC_008596.1 6648    9176    +   4535988
chr NC_008596.1 9229    10011   +   4533293
chr NC_008596.1 10184   10276   +   4535541
chr NC_008596.1 50411   11211   +   4531499

Now I need to map the positions in file1 (POS ID) with the GeneID given in file2.

The POS ID in file 1 is the number which lies between start and stop codon number. For e.g.41669 lies between

chr NC_008596.1 50411   11211   +   4531499

and the output should be

chr NC_008596.1 50411   11211   +   4531499  41669

the last one is POSID

Please let me know how can I do it? I can work in R, so any solution in R or an excel trick would be helpful.

sequencing snp excel next-gen R • 450 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by Paul50

did you look at merge in R? https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

ADD REPLYlink written 7 months ago by Santosh Anand3.0k

to use merge, I think we need to have common column in both the files. Here, POS ID in file1 is the number which lies between Start(499) and Stop(1692) in file 2.

ADD REPLYlink written 7 months ago by Paul50

So the POS ID of file 1 should lie in between the POS of file 2? What is the criteria of merge?

ADD REPLYlink written 7 months ago by Santosh Anand3.0k

I just updated with an example

ADD REPLYlink written 7 months ago by Paul50

have a look to reduce in R?

data<-Reduce(function(...) merge(..., all=TRUE, by=c("POSID", "GeneID")), list(file1, file2))

ADD REPLYlink written 7 months ago by Lila M 370
4
gravatar for Santosh Anand
7 months ago by
Santosh Anand3.0k
Santosh Anand3.0k wrote:

If you are working on range-intersection, genomic ranges are very cool and fast, although it requires a little bit of learning: See below posts:

http://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window

If efficiency is not your issue ( => files are not big), then you can run a for-loop for each POSID in file1 to see if if it falls in the same range as file2.

ADD COMMENTlink modified 7 months ago by zx87543.7k • written 7 months ago by Santosh Anand3.0k

Thanks Santosh, It worked

ADD REPLYlink written 7 months ago by Paul50
1
gravatar for Pierre Lindenbaum
7 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

convert your files to BED format using awk and the use bedtools intersect to compute the intersection

ADD COMMENTlink written 7 months ago by Pierre Lindenbaum101k
1
gravatar for Alex Reynolds
7 months ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

Save some time and use BEDOPS.

First, clean up and sort your text files:

$ awk '(NR>1){print $1"\t"$2"\t"($2+1)"\t"$2}' file1.txt | sort-bed - > file1.bed
$ tail -n+2 file2.txt | sort-bed - > file2.bed

Then map the IDs from the first BED file file1.bed to the intervals in the second BED file file2.bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' file2.bed file1.bed > answer.bed

This should give you the exact format you want.

ADD COMMENTlink written 7 months ago by Alex Reynolds21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 739 users visited in the last hour