Question

Mergebed Cannot Keep Exon Intron Structure

0

Entering edit mode

11.7 years ago

upendrakumar.devisetty ▴ 400

Hi, I have bed file and i am trying to merge the overlapping transcripts based on genomic coordinates into a single transcript. For this i used mergeBed but what i found was mergeBed basically make a long single contig without keeping the intron exon structure of the overlapping transcripts (see attached picture and example below). Is there a way to keep the structure? Also the output from the mergeBed basically is different from input bed file (see below again)![enter image description here][2] and so i am wondering is there a way to keep the structure of original bed file as well and still make an overalpping contig from the overalpped transcripts?

# Input file:
A01     1330942 1331155 comp293731_c0_seq1|m.7724       94      -       1330942 1331155 255,0,0 1       213     0
A01     1331135 1331557 comp94935_c0_seq2|m.8355        2       +       1331135 1331557 255,0,0 3       49,113,41       0,123,381

# output file
A01     1330942 1331557 comp293731_c0_seq1|m.7724;comp94935_c0_seq2|m.8355

Thanks Upendra

structure • 3.2k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 11.7 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

Please post an example of "wanted output".

ADD REPLY • link 11.7 years ago by PoGibas 5.1k

0

Entering edit mode

Ok all. I have edited my post to give more information of what i wanted. Basically i have bed file that have four overlapping transcripts (see Denovo filtered track in the picture) and all i wanted is to have a consensus/overlapping contig. For this i have used mergedBed/bedops tools but both of them gave me a bed file that have a single contig without preserving intron-exon structure (See merged Track above). Am i not thinking well here? I don't know if it can be done with any tool.

ADD REPLY • link 11.7 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

Can you write up a sample BED file that hints at the format of what you need, given the example images provided? It's likely the use of bedops and bedmap alone won't do what you need, but perhaps those two apps with awk may work. Hopefully this helps clarify the request for clarification.

ADD REPLY • link 11.7 years ago by Alex Reynolds 36k

0

Entering edit mode

This is what want

# input
A03     15742539        15744798        BRME_15388|m.2560       20      +       15742539        15744798        255,0,0 6       64,448,143,287,76,166   0,177,985,1203,1934,2093
A03     15742539        15744798        TCONS_00018124|m.13266  20      +       15742539        15744798        255,0,0 6       64,448,143,287,76,166   0,177,985,1203,1934,2093
A03     15742539        15745142        TCONS_00018125|m.13273  19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428
A03     15742539        15745142        comp263233_c0_seq2|m.10240      19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428

# merged
A03     15742539        15745142        BRME_15388|m.2560;TCONS_00018124|m.13266;TCONS_00018125|m.13273;comp263233_c0_seq2|m.10240

# desired bed
A03     15742539        15745142        comp263233_c0_seq2|m.10240      19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428

ADD REPLY • link updated 11.7 years ago by Alex Reynolds 36k • written 11.7 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

The desired bed category looks like the fourth line of input.

ADD REPLY • link 11.7 years ago by Alex Reynolds 36k

0

Entering edit mode

That is right in this case. Basically all i wanted is for those overlapped transcripts pick the one that is longest without merging the introns and exons.

ADD REPLY • link 11.7 years ago by upendrakumar.devisetty ▴ 400

score 1 · Answer 1 · 2014-04-04

Okay, start by making a contigs file:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed > contigs.bed

Now use bedmap with both --echo-map and --echo-map-size to report two pieces of information: a list of mapped transcripts (which overlap the contig) and their associated lengths:

$ bedmap --echo-map --echo-map-size contigs.bed inputs.bed > perContigTranscriptsWithSizes.txt

For each contig, this format is as follows:

transcript-1 ; transcript-2 ; ... ; transcript-N | length-1 ; length-2 ; ... ; length-N

Note that the transcripts and lengths are separated from each other by the pipe character (|), while the lists of individual transcripts and lengths are separated by semi-colons (;).

We feed this result to awk to split the transcripts and lengths into separate arrays. We find the index of the longest length and report that transcript. We pass those transcripts to sort-bed to report a sorted BED file containing the longest-length transcripts in sorted order:

$ awk '{ \
    split($0, components, "|"); \
    transcriptsStr = components[1]; \
    lengthsStr = components[2]; \
    max = 0; \
    maxIdx = -1; \
    numLengths = split(lengthsStr, lengths, ";"); \
    for (idx = 1; idx <= numLengths; idx++) { \
        if (lengths[idx] > max) { \
            maxIdx = idx; \
            max = lengths[idx]; \
        } \
    } \
    split(transcriptsStr, transcripts, ";"); \
    maxTranscript = transcripts[maxIdx]; \
    print maxTranscript; \
}' perContigTranscriptsWithSizes.txt \
| sort-bed - > answer.bed

Putting this all together into one pipeline:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed \
    | bedmap --echo-map --echo-map-size - inputs.bed \
    | awk '{ \
        split($0, components, "|"); \
        transcriptsStr = components[1]; \
        lengthsStr = components[2]; \
        max = 0; \
        maxIdx = -1; \
        numLengths = split(lengthsStr, lengths, ";"); \
        for (idx = 1; idx <= numLengths; idx++) { \
            if (lengths[idx] > max) { \
                maxIdx = idx; \
                max = lengths[idx]; \
            } \
        } \
        split(transcriptsStr, transcripts, ";"); \
        maxTranscript = transcripts[maxIdx]; \
        print maxTranscript; \
    }' - \
    | sort-bed - > answer.bed