Question

Best way to generate a bed file

0

Entering edit mode

8.2 years ago

elisheva ▴ 120

Hello everyone!

I have to do something and I kind of lost. I have 2 tab delimited text files which contains exons coordinates.

The first file contains the start coordinates (for example):

NM_032291   chr1    +   66999638    67091529    67098752    67101626
NM_001308203    chr1    +   66999251    66999928    67091529    67098752    67105459    67108492

and the second file - contains the end coordinates:

NM_032291   chr1    +   67000051    67091593    67098777    67101698
NM_001308203    chr1    +   66999355    67000051    67091593    67098777    67105516    67108547

I'd like to have multiple bed files for each exon(for example for the first exon):

 chr1    66999638     67000051     NM_032291   length    +
 chr1    66999251     66999355     NM_001308203    length    +

Each gene contains different number of exons - so the number of the columns is unknown. I believe there is a very simple way to do it, I've tried awk but without success.

Thanks!

bed • 2.5k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 8.2 years ago by elisheva ▴ 120

1

Entering edit mode

8.2 years ago

venu 7.1k

I don't understand why start coordinate is in one file and the end is in another. Albeit, you haven't provided what you've tried (which is not a best practice :) ), you can try the following

join <(sort -k1 start_file.txt) <(sort -k1 end_file.txt) | tr ' ' '\t' | awk -F'\t' '{print $1"\t"$2"\t"$4"\t"$10"\t"$10-$4"\t"$9}'

Explanation:

Join all columns in 2 files if first column in 2 files matches
Print required columns i.e. start and end of the NM_**

result

NM_032291       chr1    66999638        67000051        413     +

ADD COMMENT • link 8.2 years ago by venu 7.1k

0

Entering edit mode

Thanks! About your question: that's my data.... But I have one more problem - I just gave an example - the number of exons for each gene is different.

ADD REPLY • link 8.2 years ago by elisheva ▴ 120

0

Entering edit mode

the number of exons for each gene is different.

You mean you have same ids multiple times in first column in a same file?

ADD REPLY • link 8.2 years ago by venu 7.1k

0

Entering edit mode

I edited my post, so you can see now.... And no, I have one Id per line and then all the exons coordinates

ADD REPLY • link 8.2 years ago by elisheva ▴ 120

0

Entering edit mode

then the solution I provided should work fine. Did you get any error?

ADD REPLY • link 8.2 years ago by venu 7.1k

0

Entering edit mode

Maybe I didn't understand your answer... what does it mean $10? Isn't it simply column 10?

ADD REPLY • link 8.2 years ago by elisheva ▴ 120

0

Entering edit mode

Yes, but after combining two files into one. I added explanation.

ADD REPLY • link 8.2 years ago by venu 7.1k

score 2 · Accepted Answer · 2017-08-07

from itertools import izip

file1="file1.txt"
file2="file2.txt"

with open(file1) as f1, open(file2) as f2:
    for start_cords,end_cords in izip(f1,f2):
        start_cor = start_cords.strip().split("\t")
        end_cor = end_cords.strip().split("\t")
        for i in range(3,len(start_cor)):
            chr = start_cor[1]
            start = start_cor[i]
            end = end_cor[i]
            name = start_cor[0]
            length = int(end_cor[i])-int(start_cor[i])
            strand = start_cor[2]
            print '{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format(chr,start,end,name,length,strand)

Output:

chr1    66999638    67000051    NM_032291   413 +
chr1    67091529    67091593    NM_032291   64  +
chr1    67098752    67098777    NM_032291   25  +
chr1    67101626    67101698    NM_032291   72  +
chr1    66999251    66999355    NM_001308203    104 +
chr1    66999928    67000051    NM_001308203    123 +
chr1    67091529    67091593    NM_001308203    64  +
chr1    67098752    67098777    NM_001308203    25  +
chr1    67105459    67105516    NM_001308203    57  +
chr1    67108492    67108547    NM_001308203    55  +

Note: Its a python script.