Question: Merge overlapping CDS using bed tools
gravatar for Nathaniel
4.6 years ago by
Nathaniel90 wrote:

I have downloaded the genomic coordinates corresponding to coding sequences for the mm10 genome using BioMart, the file looks like this (the last two columns are the cds_start and cds_end, respectively)

enter image description here

I want to merge all overlapping coding regions for each gene SEPARATELY. That is, only merge overlapping regions between transcripts of the same gene. Do not overlap cds regions from different genes.

My current solution is the following, but is very slow:

for gene in `awk '{print $4}' FS="\t" OFS="\t" cds_genebody.bed | sort | uniq`; do
echo $gene
grep -w "$gene" cds_genebody.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk -v gene=$GENE '{print $0,gene}' FS="\t" OFS="\t" >> file.merged.bed

is it possible to do this in a fast way using bedtools or similar?

bedtools • 1.3k views
ADD COMMENTlink modified 4.6 years ago by Alex Reynolds31k • written 4.6 years ago by Nathaniel90

You can append the geneid to each of the chromosome names to create a pseudo-chromosome for each transcript. Merge based on the pseuod-chromosomes. Then remove the appended geneid to each pseudo-chromosome.

ADD REPLYlink written 4.6 years ago by Damian Kao15k

Hello Damian, thanks for the answer, but I don't understand what you propose. What is going to change from having a gene_id for each transcript to having an equivalent pseudo-chromosome id for each transcript? The procedure is going to be equally slow with my bash-based approach

ADD REPLYlink written 4.6 years ago by Nathaniel90

Maybe I am not fully understanding what you are trying to do. Right now you are extracting bed entries for each gene and running bedtools merge individually? If you append geneID to each chromosome to create pseudo-chromosomes, you will only have to run bedtools merge once on the modified file. Then just process the resulting file and remove the pseudo-chromosome.

For example, let's say your file is:

chr1 100 200 transcriptA gene01
chr1 150 200 transcriptB gene01
chr1 10 200 transcriptC gene01

chr1 100 200 transcriptA gene02
chr1 150 200 transcriptB gene02
chr1 10 200 transcriptC gene02

If you run bedtools merge on this file, you'll merge transcripts from both gene01 and gene02, which is not what you want. You want to merge transcriptA-C of gene1 and transcriptA-C of gene02 separately.

If you convert your bedfile to this:

chr1_gene01 100 200 transcriptA gene01
chr1_gene01 150 200 transcriptB gene01
chr1_gene01 10 200 transcriptC gene01

chr1_gene02 100 200 transcriptA gene02
chr1_gene02 150 200 transcriptB gene02
chr1_gene02 10 200 transcriptC gene02

And run bedtools merge. Now bedtools will properly merge the genes separately, because the chromosomes are different. From the resulting merged file, you'll just have to remove the _gene0X tags from the chromosomes.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Damian Kao15k

now I get it, very smart trick, thank you!!

ADD REPLYlink written 4.6 years ago by Nathaniel90
gravatar for Alex Reynolds
4.6 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You don't need to sort every time, if your input is sorted; just sort once at the beginning. Since the output of grep will be in the same order as found in the input, there will be no need to pipe to sort on each loop iteration. The merged output will likely be out of order, so you can do a second sort at the end.

To speed up grep, you can also add the LANG variable and do a fixed string search, instead of searching against a regular expression pattern:

$ sort-bed cds_genebody.unsorted.bed > cds_genebody.bed
$ for gene in `cut -f4 cds_genebody.bed | sort | uniq`; do
    echo $gene
    grep -F "$gene" cds_genebody.bed | ... >> file.merged.unsorted.bed
$ sort-bed file.merged.unsorted.bed > file.merged.bed

You can also munge the chromosome name, but with a couple small changes the performance of your original approach can be improved a lot.

Another option is to parallelize the search work with GNU Parallel, since work done in each iteration of that for loop is independent of all the others. For example:

$ sort-bed cds_genebody.unsorted.bed > cds_genebody.bed
$ cut -f4 cds_genebody.bed \
    | sort \
    | uniq \
    | parallel --pipe "echo {} && grep -F {} cds_genebody.bed | bedops --merge - | awk -v gene={} '{print \$0, gene}' >> file.merged.unsorted.bed"
$ sort-bed file.merged.unsorted.bed > file.merged.bed

There's a great thread here with usage examples of GNU Parallel in bioinformatics settings.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Alex Reynolds31k

Thanks a lot for the answer Alex, I ended up using Damian's approach mentioned in his answer and it worked perfectly. However, I also save your result because it is going to be useful for further analysis.

ADD REPLYlink written 4.6 years ago by Nathaniel90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2011 users visited in the last hour