How To Convert Gencode Gtf Into Bed Format ?

9

Entering edit mode

12.7 years ago

biorepine ★ 1.5k

I have tried this script but did not work.

	#!/usr/bin/env python3
	'''
	gtf2bed.py converts GTF file to BED file.
	Usage: gtf2bed.py {OPTIONS} [.GTF file]

	History
	Nov.5th 2012:
	1. Allow conversion from general GTF files (instead of only Cufflinks supports).
	2. If multiple identical transcript_id exist, transcript_id will be appended a string like "_DUP#" to separate.
	'''

	import sys;
	import re;

	if len(sys.argv)<2:
	print('This script converts .GTF into .BED annotations.\n');
	print('Usage: gtf2bed {OPTIONS} [.GTF file]\n');
	print('Options:');
	print('-c color\tSpecify the color of the track. This is a RGB value represented as "r,g,b". Default 255,0,0 (red)');
	print('\nNote:');
	print('1\tOnly "exon" and "transcript" are recognized in the feature field (3rd field).');
	print('2\tIn the attribute list of .GTF file, the script tries to find "gene_id", "transcript_id" and "FPKM" attribute, and convert them as name and score field in .BED file.');
	print('Author: Wei Li (li.david.wei AT gmail.com)');
	sys.exit();

	color='255,0,0'


	for i in range(len(sys.argv)):
	if sys.argv[i]=='-c':
	color=sys.argv[i+1];


	allids={};

	def printbedline(estart,eend,field,nline):
	try:
	estp=estart[0]-1;
	eedp=eend[-1];
	# use regular expression to get transcript_id, gene_id and expression level
	geneid=re.findall(r'gene_id \"([\w\.]+)\"',field[8])
	transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8])
	fpkmval=re.findall(r'FPKM \"([\d\.]+)\"',field[8])
	if len(geneid)==0:
	print('Warning: no gene_id field ',file=sys.stderr);
	else:
	geneid=geneid[0];
	if len(transid)==0:
	print('Warning: no transcript_id field',file=sys.stderr);
	transid='Trans_'+str(nline);
	else:
	transid=transid[0];
	if transid in allids.keys():
	transid2=transid+'_DUP'+str(allids[transid]);
	allids[transid]=allids[transid]+1;
	transid=transid2;
	else:
	allids[transid]=1;
	if len(fpkmval)==0:
	#print('Warning: no FPKM field',file=sys.stderr);
	fpkmval='100';
	else:
	fpkmval=fpkmval[0];
	fpkmint=round(float(fpkmval));
	print(field[0]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+transid+'\t'+str(fpkmint)+'\t'+field[6]+'\t'+str(estp)+'\t'+str(eedp)+'\t'+color+'\t'+str(len(estart))+'\t',end='');
	seglen=[eend[i]-estart[i]+1 for i in range(len(estart))];
	segstart=[estart[i]-estart[0] for i in range(len(estart))];
	strl=str(seglen[0]);
	for i in range(1,len(seglen)):
	strl+=','+str(seglen[i]);
	strs=str(segstart[0]);
	for i in range(1,len(segstart)):
	strs+=','+str(segstart[i]);
	print(strl+'\t'+strs);
	except ValueError:
	print('Error: non-number fields at line '+str(nline),file=sys.stderr);






	estart=[];
	eend=[];
	# read lines one to one
	nline=0;
	prevfield=[];
	prevtransid='';
	for lines in open(sys.argv[-1]):
	field=lines.strip().split('\t');
	nline=nline+1;
	if len(field)<9:
	print('Error: the GTF should has at least 9 fields at line '+str(nline),file=sys.stderr);
	continue;
	if field[1]!='Cufflinks':
	pass;
	#print('Warning: the second field is expected to be \'Cufflinks\' at line '+str(nline),file=sys.stderr);
	if field[2]!='exon' and field[2] !='transcript':
	#print('Error: the third filed is expected to be \'exon\' or \'transcript\' at line '+str(nline),file=sys.stderr);
	continue;
	transid=re.findall(r'transcript_id \"([\w\.]+)\"',field[8]);
	if len(transid)>0:
	transid=transid[0];
	else:
	transid='';
	if field[2]=='transcript' or (prevtransid != '' and transid!='' and transid != prevtransid):
	#print('prev:'+prevtransid+', current:'+transid);
	# A new transcript record, write
	if len(estart)!=0:
	printbedline(estart,eend,prevfield,nline);
	estart=[];
	eend=[];
	prevfield=field;
	prevtransid=transid;
	if field[2]=='exon':
	try:
	est=int(field[3]);
	eed=int(field[4]);
	estart+=[est];
	eend+=[eed];
	except ValueError:
	print('Error: non-number fields at line '+str(nline),file=sys.stderr);
	# the last record
	if len(estart)!=0:
	printbedline(estart,eend,field,nline);

view raw gtf2bed.py hosted with ❤ by GitHub

Do you guys have any working method to convert gtf in to bed format ?

Thanks

gtf bed • 71k views

ADD COMMENT • link updated 2.6 years ago by leshaker • 0 • written 12.7 years ago by biorepine ★ 1.5k

18

Entering edit mode

12.4 years ago

Alex Reynolds 36k

BEDOPS includes a gtf2bed conversion utlity, which is lossless in that it permits reconversion back to GTF after, for example, applying set and statistical operations with bedops, bedmap, etc.:

$ gtf2bed < foo.gtf > foo.bed

Apply some operations, perhaps to build a subset of elements that overlap some ad-hoc regions-of-interest, e.g.:

$ bedops --element-of 1 foo.bed regions_of_interest.bed > foo_subset.bed

To reconvert, a simple awk statement puts columns back into GTF-ordering, along with the correct, 1-based coordinate index adjustment:

$ awk `{ print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10))) }' foo_subset.bed > foo_subset.gtf

ADD COMMENT • link 9.4 years ago by Alex Reynolds 36k

1

Entering edit mode

gtf2bed from bedops do not work for GENCODE comprehensive gtf file if there are features without transcript ID in the attributes:

convert2bed -i gtf < gencode.v27lift37.annotation.gtf > gencode.v27lift37.annotation.bed    
Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

ADD REPLY • link 7.2 years ago by bounlu ▴ 270

0

Entering edit mode

This is a long-standing problem with research groups putting out malformed GTF for some still-unexplained reason. See A: BEDOPS gtf2bed conversion error with Ensembl GTF for a potential solution.

ADD REPLY • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

Had this today. The solution that worked for me was: cat input.gtf.gz | gunzip - | grep transcript_id | grep gene_id | convert2bed --do-not-sort --input=gtf - > output.bed

ADD REPLY • link 6.1 years ago by russhh 5.8k

0

Entering edit mode

The gawk command to make BED to GTF is fine. However, it would be a complete round trip of convenience if there is bed2gtf command in BEDOPS :)

ADD REPLY • link 9.4 years ago by biocyberman ▴ 870

1

Entering edit mode

It would require some assumptions about how conversion was done. So long as the BED data were created with gtf2bed, it would be easier to make those assumptions, however.

ADD REPLY • link 9.4 years ago by Alex Reynolds 36k

14

Entering edit mode

9.0 years ago

endrebak ▴ 980

My solution, based on Ian's answer:

zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz |  awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1    11868   14408   ENSG00000223972.5       .       +
chr1    14403   29569   ENSG00000227232.5       .       -
chr1    17368   17435   ENSG00000278267.1       .       -
chr1    29553   31108   ENSG00000243485.3       .       +
chr1    30365   30502   ENSG00000274890.1       .       +
chr1    34553   36080   ENSG00000237613.2       .       -
chr1    52472   53311   ENSG00000268020.3       .       +
chr1    62947   63886   ENSG00000240361.1       .       +
chr1    69090   70007   ENSG00000186092.4       .       +
chr1    89294   133722  ENSG00000238009.6       .       -

Gives you all the genes, with their name, in bed format.

You can use the score field to store other info you are interested in, like the common gene name:

zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz |  awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1    11868   14408   ENSG00000223972.5       DDX11L1 +
chr1    14403   29569   ENSG00000227232.5       WASH7P  -
chr1    17368   17435   ENSG00000278267.1       MIR6859-1       -
chr1    29553   31108   ENSG00000243485.3       RP11-34P13.3    +
chr1    30365   30502   ENSG00000274890.1       MIR1302-2       +
chr1    34553   36080   ENSG00000237613.2       FAM138A -
chr1    52472   53311   ENSG00000268020.3       OR4G4P  +
chr1    62947   63886   ENSG00000240361.1       OR4G11P +
chr1    69090   70007   ENSG00000186092.4       OR4F5   +
chr1    89294   133722  ENSG00000238009.6       RP11-34P13.7    -

ADD COMMENT • link 8.6 years ago by endrebak ▴ 980

8

Entering edit mode

I think there's a mistake in your solution. GTF files are 1-based and inclusive on both sides of the interval; BED is 0-based and non-inclusive on the right. Thus to convert GTF interval directly to BED interval, you need to do ($4-1,$5) - not ($4-1,$5-1).

ADD REPLY • link 8.6 years ago by predeus ★ 2.1k

1

Entering edit mode

Thanks, changed my answer :)

ADD REPLY • link 8.6 years ago by endrebak ▴ 980

0

Entering edit mode

good and easy solution

ADD REPLY • link 8.8 years ago by tiago211287 ★ 1.5k

4

Entering edit mode

12.7 years ago

Ian 6.1k

You could use a simple AWK one-liner (Linux):

$ cat file.gtf | awk '{print $1,$4,$5,"name",$6,$7}'

$1 is the first column of your TAB delimited GTF file, $2 is the second column, $3 is the third, etc. Not sure what you would use a name, I guess you could use $3.

EDIT:

If you don't like the command line then Galaxy has a tool "ConvertFormats > GFF-to-BED". The tool does use $3 as the name.

ADD COMMENT • link updated 9.4 years ago by Alex Reynolds 36k • written 12.7 years ago by Ian 6.1k

11

Entering edit mode

gtf format is 1-based start: http://www.ensembl.org/info/website/upload/gff.html

bed format is 0-based start: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

So this solution will get all coordinates wrong by one base.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.8 years ago by dan.halligan ▴ 110

0

Entering edit mode

But I need the full BED (BED12) that include exon information. The awk liner gives only transcript start and end but not exon start and end.

ADD REPLY • link 12.7 years ago by biorepine ★ 1.5k

0

Entering edit mode

If the information is delimited by tabs you should be able to add to the awk command... I admit i am not overly familiar with GTF.

ADD REPLY • link 12.7 years ago by Ian 6.1k

0

Entering edit mode

GTF annotates transcript and exon information in separate rows. If you use awk to print just columns what you get is start end of exon or transcript separately but not together as in BED12 format http://genome.ucsc.edu/FAQ/FAQformat.html.

ADD REPLY • link 12.7 years ago by biorepine ★ 1.5k

1

Entering edit mode

7.5 years ago

samuel ▴ 260

I found this handy link for bed files: https://github.com/stevekm/reference-annotations

ADD COMMENT • link 7.5 years ago by samuel ▴ 260

2

Entering edit mode

its worth noting that the Makefile there was developed based on the answers here

ADD REPLY • link 7.5 years ago by steve ★ 3.5k

0

Entering edit mode

Does not work anymore.

$ make ensembl-hg38
wget ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr.gtf.gz
--2021-03-02 12:19:45--  ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr.gtf.gz
           => 'Homo_sapiens.GRCh38.91.chr.gtf.gz'
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.197.76
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.197.76|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-91/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.91.chr.gtf.gz ... 41854359
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.91.chr.gtf.gz ... done.
Length: 41854359 (40M) (unauthoritative)

100%[=========================================================================>] 41,854,359  1.25MB/s   in 23s    

2021-03-02 12:20:10 (1.75 MB/s) - 'Homo_sapiens.GRCh38.91.chr.gtf.gz' saved [41854359]

zcat Homo_sapiens.GRCh38.91.chr.gtf.gz | grep -Ev '^#' | grep -w 'gene' | sed -e 's/^/chr/' -e 's/^chrMT/chrM/' > Homo_sapiens.GRCh38.91.chr.gtf
gtf2bed < Homo_sapiens.GRCh38.91.chr.gtf > Homo_sapiens.GRCh38.91.chr.bed
Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
make: *** [Makefile:75: Homo_sapiens.GRCh38.91.chr.bed] Error 61
rm Homo_sapiens.GRCh38.91.chr.gtf Homo_sapiens.GRCh38.91.chr.gtf.gz

ADD REPLY • link updated 3.5 years ago by GenoMax 152k • written 4.4 years ago by cjgunase ▴ 50

0

Entering edit mode

please file an issue on that GitHub repo and ping me there, thanks

ADD REPLY • link 3.5 years ago by steve ★ 3.5k

1

Entering edit mode

3.6 years ago

D. Puthier ▴ 350

Hi, Alternatively use pygtftk (here using CLI):

gtftk get_example | gtftk convert -f bed -n feature,gene_id,transcript_id

There are additional arguments that may be helpful:

gtftk get_example | gtftk convert -f bed -n feature,gene_id,transcript_id -s '^' -m 'a_test'

Best

Disclosure I'm the pygtftk developer.

ADD COMMENT • link 3.6 years ago by D. Puthier ▴ 350

1

Entering edit mode

Please provide a link for gtftk in answer above.

ADD REPLY • link 3.6 years ago by GenoMax 152k

0

Entering edit mode

needed to google it myself but like the approach! https://github.com/dputhier/pygtftk

ADD REPLY • link 2.6 years ago by leshaker • 0

Login before adding your answer.