Question: Bedtools: merge all features with the same id
3
gravatar for cyril-cros
4.0 years ago by
cyril-cros890
France
cyril-cros890 wrote:

I would like to  use bedtools merge to collapse together all the features sharing a same gene_ id in my bed file (which contains the annotation of various genes - the name column (4th one) also corresponds to the gene_id). Due to splicing, elements may be quite distant....

chr14    49894259    49895806    ENSMUST00000053290    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49894873    49894876    ENSMUST00000053290    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49894876    49895800    ENSMUST00000053290    0.000000    -    mm10_ensGene    CDS    0    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49895797    49895800    ENSMUST00000053290    0.000000    -    mm10_ensGene    start_codon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49901908    49901941    ENSMUST00000053290    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290"; 

bed annotation bedtools • 5.6k views
ADD COMMENTlink modified 4.0 years ago by dariober10k • written 4.0 years ago by cyril-cros890
1

What are you really trying to achieve in the end? For example, do you need coding gene lenghts because you want to calculate transcript-size corrected gene expression values from RNA sequencing (aka [FRC]PKM?) In that case, have a look at Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor post.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Irsan6.8k

Nice link. I wanted to collapse my annotation to a more convenient form to intersect a de novo annotation to a reference annotation with gene_ids (see How To Get Annotation For Bed File From Another Bed File). I do this to assign the official Ensembl ids to my new annotation. I don't really need the whole transcript...

However, some parts of my annotation have a rather insufficient coverage - your script will help me a lot to identify them.

ADD REPLYlink written 4.0 years ago by cyril-cros890

Are you starting out with a GTF or a bed file? I think I have done something similar starting out with a GTF and using bedtools merge.

ADD REPLYlink written 4.0 years ago by A. Domingues2.0k
9
gravatar for dariober
4.0 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Hi- You don't say what to do with the collapsed features. This piece of code collapses the gene ids and counts how many features have been grouped together along with the min and max coordinates. Note it uses bedtools groupby not merge.

Sample data:

cat genes.bed
chr14    49894259    49895806    ENSMUST00000053290    0.000000    ...
chr14    49894873    49894876    ENSMUST00000053290    0.000000    ...
chr14    49894876    49895800    ENSMUST00000053291    0.000000    ...
chr14    49895797    49895800    ENSMUST00000053291    0.000000    ...
chr14    49901908    49901941    ENSMUST00000053291    0.000000    ...

Example output:

sort -k4,4 genes.bed \
| groupBy -g 1,4 -c 4,2,3 -o count,min,max \
| awk -v OFS='\t' '{print $1, $4, $5, $2, $3}'

chr14    49894259    49895806    ENSMUST00000053290    2
chr14    49894876    49901941    ENSMUST00000053291    3

 

ADD COMMENTlink written 4.0 years ago by dariober10k
1

The code didn't work for me at first. The groupBy function summarized all lines of the bed file and the output file contained one entry which covered the whole genome. It took me a while to figure out that the problem was the recent bedtools version 2.26.0. When I switched back to version 2.25.0 everything worked as expected. Thanks for your solution anyway, it helped me a lot.

ADD REPLYlink written 18 months ago by akirbis10

Thanks, it does the job nicely.

ADD REPLYlink written 4.0 years ago by cyril-cros890

Thanks! That is exactly what I was looking for. Best

ADD REPLYlink written 2.7 years ago by floriandeckert0
1
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Here's an untested way to do this with BEDOPS. Merging intervals in a BED file is "destructive" to input regions, so there's no need to parse out anything beyond the first four fields.

#!/bin/bash +x

inFn=$1
outDir=$2
tmpDir="/tmp/split"
mkdir -p $tmpDir
mkdir -p $outDir
awk -v tmpDir=$tmpDir '
    BEGIN {
        while (getline < "'"$inFn"'") {
            n = split($0, elems, "\t");
            chr = elems[1];
            start = elems[2];
            stop = elems[3];
            id = elems[4];
            out_fn = tmpDir"/"id".id.bed";
            print chr"\t"start"\t"stop >> out_fn;
        }
        close("'"$inFn"'");
    }
'
for $perIdFn in `ls $tmpDir/*.id.bed`;
do
    perIdBaseFn=$(basename "$perIdFn")
    bedops --merge $perIdFn > $outDir/$perIdBaseFn.merged.bed
done
rm -rf $tmpDir

Usage:

$ ./splitByIdAndMerge.bash unmerged_features.bed merged_features_directory
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds28k

Thanks for your help. My main issue was that I could not easily obtain a relation in the style of one gene, one large interval. I wanted to find which genes were included in a large bed file.
By only taking the start_codon feature, I have a unique location for each gene. Granted, I miss any small overlap... I will still try your solution and validate it later if it works. A simple python parser might also work if I sort by gene_id my entries...

ADD REPLYlink written 4.0 years ago by cyril-cros890

You might look into BEDOPS bedmap. Given an interval and a set of features (both sorted per BEDOPS sort-bed), bedmap will report all the overlapping features for a given interval. You can then chop up results with Python, Perl, etc.

ADD REPLYlink written 4.0 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1714 users visited in the last hour