Question: Parsing GENCODE GTF to simpler BED files: am I reinventing the wheel ?
0
gravatar for Charles Plessy
2.4 years ago by
Charles Plessy2.5k
Japan
Charles Plessy2.5k wrote:

Dear Biostars,

I am using a shell script (https://gist.github.com/charles-plessy/9dbc8bc98fb773bf71b6) to transform a GENCODE GTF file into smaller BED files that I use to annotate transcriptome (CAGE) with information such as promoter/intron/exon classification or gene name.

Just as a reminder, GENCODE looks like this:

$ zcat gencode.v23.annotation.gtf.gz | cut -c -80 | head
##description: evidence-based annotation of the human genome (GRCh38), version 2
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2015-07-15
chr1    HAVANA    gene    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; gene_type "trans
chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript
chr1    HAVANA    exon    11869    12227    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    12613    12721    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E
chr1    HAVANA    exon    13221    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "E

The kind of BED files I produce look like that:

$ head gencode.v23.annotation.bed
chr1    11368    12369    promoter    0    +
chr1    11858    11879    boundary    0    +
chr1    11868    12227    exon    0    +
chr1    11868    14409    gene    0    +
chr1    11868    14409    transcribed_unprocessed_pseudogene_DDX11L1    0    +
chr1    11999    12020    boundary    0    +
chr1    12009    12057    exon    0    +
chr1    12046    12067    boundary    0    +
chr1    12168    12189    boundary    0    +
chr1    12178    12227    exon    0    +
$ head gencode.v23.annotation.genes.bed
chr1    11868    14409    DDX11L1    0    +
chr1    14403    29570    WASH7P    0    -
chr1    17368    17436    MIR6859-1    0    -
chr1    29553    31109    RP11-34P13.3    0    +
chr1    30365    30503    MIR1302-2    0    +
chr1    34553    36081    FAM138A    0    -
chr1    52472    53312    OR4G4P    0    +
chr1    62947    63887    OR4G11P    0    +
chr1    69090    70008    OR4F5    0    +
chr1    89294    133723    RP11-34P13.7    0    -

Instead of maintaining a script by myself, I would love to use a commonly used, proof-tested, well-maintained tool.  Do you have something to recommend to me ?

Thanks !

gencode bed gtf • 1.1k views
ADD COMMENTlink modified 2.4 years ago by Alex Reynolds24k • written 2.4 years ago by Charles Plessy2.5k
0
gravatar for Alex Reynolds
2.4 years ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

You could use the GTF option in BEDOPS convert2bed, or the equivalent wrapper script gtf2bed:

$ convert2bed -i gtf -o bed < foo.gtf > foo.bed
$ gtf2bed < foo.gtf > foo.bed

If you need columns in a certain ordering, or only some subset of BED columns, you can pipe the result to common Unix tools like cut and awk.

$ gtf2bed < foo.gtf | cut -f1-6 > foo.bed6
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 959 users visited in the last hour