Question: Gencode Gtf To Bed12 Or Psl
4
gravatar for Geparada
7.4 years ago by
Geparada1.4k
Cambridge
Geparada1.4k wrote:

Hi!

I has been using gene models available in UCSC Table Browser for a long time. Those are in BED12 or PSL and I made many scripts to process this kind of format, but now I want to work with the last released version of Gencode (v9) which are only in GTF format.

Do you know a script to convert GTF in BED12 or PSL?

Thanks for you help!

gtf annotation ucsc bed • 10.0k views
ADD COMMENTlink modified 3.4 years ago by Czh3190 • written 7.4 years ago by Geparada1.4k
1
 

#!/usr/PATH/TO/python3

'''

gtf2bed.py converts GTF file to BED file.

Usage: gtf2bed.py {OPTIONS} [.GTF file]

History

Nov.5th 2012:

1. Allow conversion from general GTF files (instead of only Cufflinks supports).

2. If multiple identical transcript_id exist, transcript_id will be appended a string like "_DUP#" to separate.

'''



import sys;

import re;



if len(sys.argv)<2:

print('This script converts .GTF into .BED annotations.\n');

print('Usage: gtf2bed {OPTIONS} [.GTF file]\n');

print('Options:');

print('-c color\tSpecify the color of the track. This is a RGB value represented as "r,g,b". Default 255,0,0 (red)');

print('\nNote:');

print('1\tOnly "exon" and "transcript" are recognized in the feature field (3rd field).');

print('2\tIn the attribute list of .GTF file, the script tries to find "gene_id", "transcript_id" and "FPKM" attribute, and convert them as name and score field in .BED file.');

print('Author: Wei Li (li.david.wei AT gmail.com)');

sys.exit();



color='255,0,0'





for i in range(len(sys.argv)):

if sys.argv[i]=='-c':

color=sys.argv[i+1];





allids={};



def printbedline(estart,eend,field,nline):

try:

estp=estart[0]-1;

eedp=eend[-1];

# use regular expression to get transcript_id, gene_id and expression level

geneid=re.findall(r'gene_id \"([\w\.]+)\"',field[8])

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8])

fpkmval=re.findall(r'FPKM \"([\d\.]+)\"',field[8])

if len(geneid)==0:

print('Warning: no gene_id field ',file=sys.stderr);

else:

geneid=geneid[0];

if len(transid)==0:

print('Warning: no transcript_id field',file=sys.stderr);

transid='Trans_'+str(nline);

else:

transid=transid[0];

if transid in allids.keys():

transid2=transid+'_DUP'+str(allids[transid]);

allids[transid]=allids[transid]+1;

transid=transid2;

else:

allids[transid]=1;

if len(fpkmval)==0:

#print('Warning: no FPKM field',file=sys.stderr);

fpkmval='100';

else:

fpkmval=fpkmval[0];

fpkmint=round(float(fpkmval));

print(field[0]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+transid+'\t'+str(fpkmint)+'\t'+field[6]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+color+'\t'+str(len(estart))+'\t',end='');

seglen=[eend[i]-estart[i]+1 for i in range(len(estart))];

segstart=[estart[i]-estart[0] for i in range(len(estart))];

strl=str(seglen[0]);

for i in range(1,len(seglen)):

strl+=','+str(seglen[i]);

strs=str(segstart[0]);

for i in range(1,len(segstart)):

strs+=','+str(segstart[i]);

print(strl+'\t'+strs);

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);













estart=[];

eend=[];

# read lines one to one

nline=0;

prevfield=[];

prevtransid='';

for lines in open(sys.argv[-1]):

field=lines.strip().split('\t');

nline=nline+1;

if len(field)<9:

print('Error: the GTF should has at least 9 fields at line '+str(nline),file=sys.stderr);

continue;

if field[1]!='Cufflinks':

pass;

#print('Warning: the second field is expected to be \'Cufflinks\' at line '+str(nline),file=sys.stderr);

if field[2]!='exon' and field[2] !='transcript':

#print('Error: the third filed is expected to be \'exon\' or \'transcript\' at line '+str(nline),file=sys.stderr);

continue;

transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8]);

if len(transid)>0:

transid=transid[0];

else:

transid='';

if field[2]=='transcript' or (prevtransid != '' and transid!='' and transid != prevtransid):

#print('prev:'+prevtransid+', current:'+transid);

# A new transcript record, write

if len(estart)!=0:

printbedline(estart,eend,prevfield,nline);

estart=[];

eend=[];

prevfield=field;

prevtransid=transid;

if field[2]=='exon':

try:

est=int(field[3]);

eed=int(field[4]);

estart+=[est];

eend+=[eed];

except ValueError:

print('Error: non-number fields at line '+str(nline),file=sys.stderr);

# the last record

if len(estart)!=0:

printbedline(estart,eend,field,nline);
ADD REPLYlink modified 3.8 years ago by Gjain5.3k • written 3.8 years ago by xiachongjing10
3
gravatar for sjneph
6.2 years ago by
sjneph600
sjneph600 wrote:

Take a look at gtf2bed. As with all conversion utilities available in BEDOPS, there is no loss of information in the conversions to a BED-like format supported by that tool suite.

ADD COMMENTlink written 6.2 years ago by sjneph600
2
gravatar for Casey Bergman
7.4 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

The UCSC source tree has utilities for gtfToGenePred (convert a GTF file to a genePred) and genePredToPsl (Program to create fake psl alignments from genePred records). The latter can output BED with the -bedFormat option. These (like all source tree utilities) can be piped together do the conversion you are looking for. The source tree and instructions for compilation are here: http://hgdownload.cse.ucsc.edu/admin/jksrc.zip

ADD COMMENTlink written 7.4 years ago by Casey Bergman18k
2
gravatar for Czh3
3.4 years ago by
Czh3190
China
Czh3190 wrote:

This script may help you.

https://github.com/Czh3/NGSTools/blob/master/script/gtf2bed12.sh

ADD COMMENTlink written 3.4 years ago by Czh3190
1

Hi,

just a note that the script is a really nice idea based on UCSC tools. However, currently it's not converting to the correct format.

Your Column 11: exon start site relative to chromosome start

Bed12 Column 11: exon length

Your Column 12: exon end site relative to chromosome start

Bed12 Column 12: exon start site relative to feature start

Best wishes, Jakub

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by jakub30

Thank you for pointing this out! Furthermore, there is also an UCSC tool genePredToBed: GitHub link. Thus you could use the output of gtfToGenePred as input of genePredToBed. Something like

gtfToGenePred input.gtf output.tmp; genePredToBed output.tmp output.bed; rm output.tmp

ADD REPLYlink written 24 months ago by e.rempel760
1
gravatar for Gjain
7.4 years ago by
Gjain5.3k
Göttingen, Germany
Gjain5.3k wrote:

The gencode V9 GTF format is similar to their V3 format.

this link should help you with everything what you are looking for.

http://www.sanger.ac.uk/resources/databases/encode/gencodeformat.html

hope this helps.

ADD COMMENTlink written 7.4 years ago by Gjain5.3k

Thanks for the link, now I understand the fields meaning, but I asked for some script to convert this format to BED12. Well, I think I have to write it myself...

ADD REPLYlink written 7.4 years ago by Geparada1.4k

yup. all you need to do is parse the fields and then just rearrange it for the format you want. let me know if you need help in that.

ADD REPLYlink written 7.4 years ago by Gjain5.3k
1
gravatar for Statler
7.2 years ago by
Statler10
Statler10 wrote:

Actually, use the ucsc table browser in the Preview version of the genome browser, you select the genes/gene prediction table, then the appropriate dataset.

Select to download as bed format and you are done.

ADD COMMENTlink written 7.2 years ago by Statler10

Since not much time ago I'm using the Preview version of the genome browser... I didn't know about it when I did this question!

ADD REPLYlink written 7.2 years ago by Geparada1.4k
0
gravatar for Anna
7.4 years ago by
Anna40
Anna40 wrote:

hi, I am having the same problem to convert from gff to bed12 - since the latter is a requirement for coverageBed -split. any ideas? or should I write my own thing?

thanks!

Anna

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Anna40
2

Sure, take a look at gff2bed, and the bedmap program from the same tool suite. No reason to convert to bed12 - the gff2bed script will fully convert all information for you to a BED-like format that bedmap can use to give you the coverage information you want.

ADD REPLYlink written 6.2 years ago by sjneph600

@anna, you should delete this answer and re-post as a new question.

ADD REPLYlink written 7.4 years ago by Casey Bergman18k

Could you please tell in more detail what problems you are facing? I have a script i wrote which does the same.

ADD REPLYlink written 7.4 years ago by Gjain5.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 835 users visited in the last hour