Okay, start by making a contigs file:
$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed > contigs.bed
Now use bedmap with both --echo-map and --echo-map-size to report two pieces of information: a list of mapped transcripts (which overlap the contig) and their associated lengths:
$ bedmap --echo-map --echo-map-size contigs.bed inputs.bed > perContigTranscriptsWithSizes.txt
For each contig, this format is as follows:
transcript-1 ; transcript-2 ; ... ; transcript-N | length-1 ; length-2 ; ... ; length-N
Note that the transcripts and lengths are separated from each other by the pipe character (|), while the lists of individual transcripts and lengths are separated by semi-colons (;).
We feed this result to awk to split the transcripts and lengths into separate arrays. We find the index of the longest length and report that transcript. We pass those transcripts to sort-bed to report a sorted BED file containing the longest-length transcripts in sorted order:
$ awk '{ \
split($0, components, "|"); \
transcriptsStr = components[1]; \
lengthsStr = components[2]; \
max = 0; \
maxIdx = -1; \
numLengths = split(lengthsStr, lengths, ";"); \
for (idx = 1; idx <= numLengths; idx++) { \
if (lengths[idx] > max) { \
maxIdx = idx; \
max = lengths[idx]; \
} \
} \
split(transcriptsStr, transcripts, ";"); \
maxTranscript = transcripts[maxIdx]; \
print maxTranscript; \
}' perContigTranscriptsWithSizes.txt \
| sort-bed - > answer.bed
Putting this all together into one pipeline:
$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed \
| bedmap --echo-map --echo-map-size - inputs.bed \
| awk '{ \
split($0, components, "|"); \
transcriptsStr = components[1]; \
lengthsStr = components[2]; \
max = 0; \
maxIdx = -1; \
numLengths = split(lengthsStr, lengths, ";"); \
for (idx = 1; idx <= numLengths; idx++) { \
if (lengths[idx] > max) { \
maxIdx = idx; \
max = lengths[idx]; \
} \
} \
split(transcriptsStr, transcripts, ";"); \
maxTranscript = transcripts[maxIdx]; \
print maxTranscript; \
}' - \
| sort-bed - > answer.bed
Please post an example of "wanted output".
Ok all. I have edited my post to give more information of what i wanted. Basically i have bed file that have four overlapping transcripts (see Denovo filtered track in the picture) and all i wanted is to have a consensus/overlapping contig. For this i have used mergedBed/bedops tools but both of them gave me a bed file that have a single contig without preserving intron-exon structure (See merged Track above). Am i not thinking well here? I don't know if it can be done with any tool.
Can you write up a sample BED file that hints at the format of what you need, given the example images provided? It's likely the use of
bedopsandbedmapalone won't do what you need, but perhaps those two apps withawkmay work. Hopefully this helps clarify the request for clarification.This is what want
The
desired bedcategory looks like the fourth line ofinput.That is right in this case. Basically all i wanted is for those overlapped transcripts pick the one that is longest without merging the introns and exons.