Question: Mapping a chromosome position with Gene ID
2
gravatar for Paul
3 months ago by
Paul40
India
Paul40 wrote:

I am trying to map chromosome positions with all the gene IDs available.

The position file1 I have looks like this

#CHROM      POSID       REF ALT
NC_008596.1 41669   .   CG  C
NC_008596.1 78925   .   GC  G
NC_008596.1 101262  .   GC  G
NC_008596.1 115332  .   GC  G
NC_008596.1 224262  .   A   AC

and the genome position file2 looks like the following

    ChrM        Start   Stop Strand GeneID
chr NC_008596.1 499     1692    +   4535615
chr NC_008596.1 1721    2614    +   4536178
chr NC_008596.1 2624    3778    +   4532650
chr NC_008596.1 3775    4359    +   4537198
chr NC_008596.1 4591    6618    +   4531510
chr NC_008596.1 6648    9176    +   4535988
chr NC_008596.1 9229    10011   +   4533293
chr NC_008596.1 10184   10276   +   4535541
chr NC_008596.1 50411   11211   +   4531499

Now I need to map the positions in file1 (POS ID) with the GeneID given in file2.

The POS ID in file 1 is the number which lies between start and stop codon number. For e.g.41669 lies between

chr NC_008596.1 50411   11211   +   4531499

and the output should be

chr NC_008596.1 50411   11211   +   4531499  41669

the last one is POSID

Please let me know how can I do it? I can work in R, so any solution in R or an excel trick would be helpful.

sequencing snp excel next-gen R • 315 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by Paul40

did you look at merge in R? https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

ADD REPLYlink written 3 months ago by Santosh Anand2.7k

to use merge, I think we need to have common column in both the files. Here, POS ID in file1 is the number which lies between Start(499) and Stop(1692) in file 2.

ADD REPLYlink written 3 months ago by Paul40

So the POS ID of file 1 should lie in between the POS of file 2? What is the criteria of merge?

ADD REPLYlink written 3 months ago by Santosh Anand2.7k

I just updated with an example

ADD REPLYlink written 3 months ago by Paul40

have a look to reduce in R?

data<-Reduce(function(...) merge(..., all=TRUE, by=c("POSID", "GeneID")), list(file1, file2))

ADD REPLYlink written 3 months ago by Lila M 280
4
gravatar for Santosh Anand
3 months ago by
Santosh Anand2.7k
Santosh Anand2.7k wrote:

If you are working on range-intersection, genomic ranges are very cool and fast, although it requires a little bit of learning: See below posts:

http://stackoverflow.com/questions/11892241/merge-by-range-in-r-applying-loops http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window

If efficiency is not your issue ( => files are not big), then you can run a for-loop for each POSID in file1 to see if if it falls in the same range as file2.

ADD COMMENTlink modified 3 months ago by zx87543.7k • written 3 months ago by Santosh Anand2.7k

Thanks Santosh, It worked

ADD REPLYlink written 3 months ago by Paul40
1
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

convert your files to BED format using awk and the use bedtools intersect to compute the intersection

ADD COMMENTlink written 3 months ago by Pierre Lindenbaum96k
1
gravatar for Alex Reynolds
3 months ago by
Alex Reynolds19k
Seattle, WA USA
Alex Reynolds19k wrote:

Save some time and use BEDOPS.

First, clean up and sort your text files:

$ awk '(NR>1){print $1"\t"$2"\t"($2+1)"\t"$2}' file1.txt | sort-bed - > file1.bed
$ tail -n+2 file2.txt | sort-bed - > file2.bed

Then map the IDs from the first BED file file1.bed to the intervals in the second BED file file2.bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' file2.bed file1.bed > answer.bed

This should give you the exact format you want.

ADD COMMENTlink written 3 months ago by Alex Reynolds19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 465 users visited in the last hour