Question: How To Convert Gencode Gtf Into Bed Format ?
5
gravatar for biorepine
5.9 years ago by
biorepine1.4k
Spain
biorepine1.4k wrote:

I have tried this script but did not work.

<script src="&lt;a href=" 1155568"="">1155568"></script>
Do you guys have any working method to convert gtf in to bed format ?

Thanx

gtf bed • 24k views
ADD COMMENTlink modified 8 months ago by zoegward50 • written 5.9 years ago by biorepine1.4k
14
gravatar for Alex Reynolds
5.6 years ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

BEDOPS includes a gtf2bed conversion utlity, which is lossless in that it permits reconversion back to GTF after, for example, applying set and statistical operations with bedops, bedmap, etc.:

$ gtf2bed < foo.gtf > foo.bed

Apply some operations, perhaps to build a subset of elements that overlap some ad-hoc regions-of-interest, e.g.:

$ bedops --element-of 1 foo.bed regions_of_interest.bed > foo_subset.bed

To reconvert, a simple awk statement puts columns back into GTF-ordering, along with the correct, 1-based coordinate index adjustment:

$ awk `{ print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10))) }' foo_subset.bed > foo_subset.gtf
ADD COMMENTlink modified 2.6 years ago • written 5.6 years ago by Alex Reynolds25k

The gawk command to make BED to GTF is fine. However, it would be a complete round trip of convenience if there is bed2gtf command in BEDOPS :)

ADD REPLYlink written 2.6 years ago by biocyberman740
1

It would require some assumptions about how conversion was done. So long as the BED data were created with gtf2bed, it would be easier to make those assumptions, however.

ADD REPLYlink written 2.6 years ago by Alex Reynolds25k

gtf2bed from bedops do not work for GENCODE comprehensive gtf file if there are features without transcript ID in the attributes:

convert2bed -i gtf < gencode.v27lift37.annotation.gtf > gencode.v27lift37.annotation.bed    
Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)
ADD REPLYlink modified 5 months ago • written 5 months ago by bounlu140

This is a long-standing problem with research groups putting out malformed GTF for some still-unexplained reason. See A: BEDOPS gtf2bed conversion error with Ensembl GTF for a potential solution.

ADD REPLYlink modified 5 months ago • written 5 months ago by Alex Reynolds25k
6
gravatar for endrebak
2.2 years ago by
endrebak680
endrebak680 wrote:

My solution, based on Ian's answer:

zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz |  awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1    11868   14408   ENSG00000223972.5       .       +
chr1    14403   29569   ENSG00000227232.5       .       -
chr1    17368   17435   ENSG00000278267.1       .       -
chr1    29553   31108   ENSG00000243485.3       .       +
chr1    30365   30502   ENSG00000274890.1       .       +
chr1    34553   36080   ENSG00000237613.2       .       -
chr1    52472   53311   ENSG00000268020.3       .       +
chr1    62947   63886   ENSG00000240361.1       .       +
chr1    69090   70007   ENSG00000186092.4       .       +
chr1    89294   133722  ENSG00000238009.6       .       -

Gives you all the genes, with their name, in bed format.

You can use the score field to store other info you are interested in, like the common gene name:

zcat ../../../data/annotations/gencode.v24.annotation.gtf.gz |  awk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,$16,$7}}' | tr -d '";' | head
chr1    11868   14408   ENSG00000223972.5       DDX11L1 +
chr1    14403   29569   ENSG00000227232.5       WASH7P  -
chr1    17368   17435   ENSG00000278267.1       MIR6859-1       -
chr1    29553   31108   ENSG00000243485.3       RP11-34P13.3    +
chr1    30365   30502   ENSG00000274890.1       MIR1302-2       +
chr1    34553   36080   ENSG00000237613.2       FAM138A -
chr1    52472   53311   ENSG00000268020.3       OR4G4P  +
chr1    62947   63886   ENSG00000240361.1       OR4G11P +
chr1    69090   70007   ENSG00000186092.4       OR4F5   +
chr1    89294   133722  ENSG00000238009.6       RP11-34P13.7    -
ADD COMMENTlink modified 21 months ago • written 2.2 years ago by endrebak680
5

I think there's a mistake in your solution. GTF files are 1-based and inclusive on both sides of the interval; BED is 0-based and non-inclusive on the right. Thus to convert GTF interval directly to BED interval, you need to do ($4-1,$5) - not ($4-1,$5-1).

ADD REPLYlink written 22 months ago by predeus620
1

Thanks, changed my answer :)

ADD REPLYlink written 21 months ago by endrebak680

good and easy solution

ADD REPLYlink written 2.0 years ago by tiago2112871.0k
3
gravatar for Ian
5.9 years ago by
Ian5.2k
University of Manchester, UK
Ian5.2k wrote:

You could use a simple AWK one-liner (Linux):

$ cat file.gtf | awk '{print $1,$4,$5,"name",$6,$7}'

$1 is the first column of your TAB delimited GTF file, $2 is the second column, $3 is the third, etc. Not sure what you would use a name, I guess you could use $3.

EDIT:

If you don't like the command line then Galaxy has a tool "ConvertFormats > GFF-to-BED". The tool does use $3 as the name.

ADD COMMENTlink modified 2.5 years ago by Alex Reynolds25k • written 5.9 years ago by Ian5.2k
8

gtf format is 1-based start: http://www.ensembl.org/info/website/upload/gff.html

bed format is 0-based start: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

So this solution will get all coordinates wrong by one base.

ADD REPLYlink written 3.0 years ago by dan.halligan80

But I need the full BED (BED12) that include exon information. The awk liner gives only transcript start and end but not exon start and end.

ADD REPLYlink written 5.9 years ago by biorepine1.4k

If the information is delimited by tabs you should be able to add to the awk command... I admit i am not overly familiar with GTF.

ADD REPLYlink written 5.9 years ago by Ian5.2k

GTF annotates transcript and exon information in separate rows. If you use awk to print just columns what you get is start end of exon or transcript separately but not together as in BED12 format http://genome.ucsc.edu/FAQ/FAQformat.html.

ADD REPLYlink written 5.9 years ago by biorepine1.4k
1
gravatar for zoegward
8 months ago by
zoegward50
zoegward50 wrote:

I found this handy link for bed files: https://github.com/stevekm/reference-annotations

ADD COMMENTlink written 8 months ago by zoegward50

its worth noting that the Makefile there was developed based on the answers here

ADD REPLYlink written 8 months ago by steve1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 818 users visited in the last hour