Question: Best way to generate a bed file
0
gravatar for elisheva
2.9 years ago by
elisheva100
Israel
elisheva100 wrote:

Hello everyone!
I have to do something and I kind of lost.
I have 2 tab delimited text files which contains exons coordinates.

the first fie contains the start coordinates (for example):

NM_032291   chr1    +   66999638    67091529    67098752    67101626
NM_001308203    chr1    +   66999251    66999928    67091529    67098752    67105459    67108492

and the second file - contains the end coordinates:

NM_032291   chr1    +   67000051    67091593    67098777    67101698
NM_001308203    chr1    +   66999355    67000051    67091593    67098777    67105516    67108547

I'd like to have multiple bed files for each exon(for example for the first exon):

 chr1    66999638     67000051     NM_032291   length    +
 chr1    66999251     66999355     NM_001308203    length    +

each gene contains different number of exons - so the number of the columns is unknown.
I believe there is a very simple way to do it , I'v tried awk but without success.
Thanks!!

bed file exons • 772 views
ADD COMMENTlink modified 2.9 years ago by geek_y11k • written 2.9 years ago by elisheva100
2
gravatar for geek_y
2.9 years ago by
geek_y11k
Barcelona
geek_y11k wrote:
from itertools import izip

file1="file1.txt"
file2="file2.txt"

with open(file1) as f1, open(file2) as f2:
    for start_cords,end_cords in izip(f1,f2):
        start_cor = start_cords.strip().split("\t")
        end_cor = end_cords.strip().split("\t")
        for i in range(3,len(start_cor)):
            chr = start_cor[1]
            start = start_cor[i]
            end = end_cor[i]
            name = start_cor[0]
            length = int(end_cor[i])-int(start_cor[i])
            strand = start_cor[2]
            print '{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format(chr,start,end,name,length,strand)

Output:

chr1    66999638    67000051    NM_032291   413 +
chr1    67091529    67091593    NM_032291   64  +
chr1    67098752    67098777    NM_032291   25  +
chr1    67101626    67101698    NM_032291   72  +
chr1    66999251    66999355    NM_001308203    104 +
chr1    66999928    67000051    NM_001308203    123 +
chr1    67091529    67091593    NM_001308203    64  +
chr1    67098752    67098777    NM_001308203    25  +
chr1    67105459    67105516    NM_001308203    57  +
chr1    67108492    67108547    NM_001308203    55  +

Note: Its a python script.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by geek_y11k
1
gravatar for venu
2.9 years ago by
venu6.6k
Germany
venu6.6k wrote:

I don't understand why start coordinate is in one file and the end is in another. Albeit, you haven't provided what you've tried (which is not a best practice :) ), you can try the following

join <(sort -k1 start_file.txt) <(sort -k1 end_file.txt) | tr ' ' '\t' | awk -F'\t' '{print $1"\t"$2"\t"$4"\t"$10"\t"$10-$4"\t"$9}'

Explanation:

  • Join all columns in 2 files if first column in 2 files matches
  • Print required columns i.e. start and end of the NM_**

result

NM_032291       chr1    66999638        67000051        413     +
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by venu6.6k

Thanks! About your question: that's my data.... But I have one more problem - I just gave an example - the number of exons for each gene is different.

ADD REPLYlink written 2.9 years ago by elisheva100

the number of exons for each gene is different.

You mean you have same ids multiple times in first column in a same file?

ADD REPLYlink written 2.9 years ago by venu6.6k

I edited my post, so you can see now.... And no, I have one Id per line and then all the exons coordinates

ADD REPLYlink written 2.9 years ago by elisheva100

then the solution I provided should work fine. Did you get any error?

ADD REPLYlink written 2.9 years ago by venu6.6k

Maybe I didn't understand your answer... what does it mean $10? Isn't it simply column 10?

ADD REPLYlink written 2.9 years ago by elisheva100

Yes, but after combining two files into one. I added explanation.

ADD REPLYlink written 2.9 years ago by venu6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1592 users visited in the last hour