Question: Merge coordinates by gene ID
0
gravatar for r.tor
5 weeks ago by
r.tor40
r.tor40 wrote:

I have a .bed file data that is obtained from concatenating of two .bed files. It's been done through BEDOPS --everything option so, all four columns (chr\start\end\gene_ID) are preserved nicely. For each gene ID, there are a few rows of coordinates that may or may not be overlapped. I am looking for merging the coordinates belong to each gene separately, so that if they have at least one bp overlap, they will merge, and if not, they will remain separate. Merging should not implement considering all genes in one shot.
I've actually tried bedtools merge and BEDOPS merge, but could not make it because they see the whole file as one.

> data
chr1   206721  208928  ENSG00000951
chr1   207322  208145  ENSG00000951
chr1   312006  314918  ENSG00000885
chr1   312077  312277  ENSG00000885
chr1   313423  314611  ENSG00000885
chr1   315128  315716  ENSG00000885
chr1   235826  238431  ENSG00000082
chr1   242929  244929  ENSG00000627
chr1   247107  249107  ENSG00000627
chr1   249284  252043  ENSG00000627

The expected output would be like this:

 > data.output
 chr1   206721  208928  ENSG00000951
 chr1   235826  238431  ENSG00000082
 chr1   312006  314918  ENSG00000885
 chr1   315128  315716  ENSG00000885
 chr1   242929  244929  ENSG00000627
 chr1   247107  249107  ENSG00000627
 chr1   249284  252043  ENSG00000627

Thank you.

bash bedops bed bedtools • 148 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by r.tor40
2
gravatar for ATpoint
5 weeks ago by
ATpoint42k
Germany
ATpoint42k wrote:

Is it intended that ENSG00000082 is missing in data.output?

My idea would to simply use the gene name as chromosome identifier, so a bit of awk together with bedtools merge.

Something like (clunky) this:

awk 'OFS="\t" {print $4, $2, $3, $1}' data.bed \
| sort -k1,1 -k2,2n \
| bedtools merge -i - -c 4 -o distinct \
| awk 'OFS="\t" {print $4, $2, $3, $1}' \
| sort -k1,1 -k2,2n
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint42k
1

I really like the rationale behind using the 4th column as chromosome and keeping the chromosome as mapping information with -c option. I love finding different ways of using tools that were designed to do a particular job in order to achieve a different goal.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Jorge Amigo12k

The solution works well. Btw, I edited the data.output adding ENSG00000082 that was mistakenly deleted. Could you give me a little explanation how did you address this to the bedtools and what is the difference between columns for him?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by r.tor40
1

Sure. So in the first line I rearrange the BED file, using column4 ($4) as column 1 ($1) so the gene name is now the chromosome name. That makes basically every gene a unique chromosome. BEDtools merges by chromosome and position. Since we used the gene name as chromosome name it will therefore merge only by gene name given there are overlaps. The actual chromosome I moved to $4 to keep it as name, and after merge simply switched it again, moving $1 back to $1 and $1 back to $4. The first sort is necessary is BEDtools expects sorted input, the last sort is optional. Does that make sense to you?

ADD REPLYlink written 5 weeks ago by ATpoint42k

Yup! such a smart workaround:)

ADD REPLYlink written 5 weeks ago by r.tor40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1277 users visited in the last hour