Question: Is It Possible To Map Transcripts To Genes Given That We Have A .Gtf File?
gravatar for Jordan
6.9 years ago by
Jordan1.1k wrote:


I'm doing RNA-seq analysis and I have a .gtf file produced by cufflinks which includes transcripts with their locations. I wanted to know if there is a way to map these transcript positions to gene id's?


gtf gene cufflinks rna-seq • 3.6k views
ADD COMMENTlink modified 6.9 years ago by Alex Reynolds29k • written 6.9 years ago by Jordan1.1k
gravatar for Alex Reynolds
6.9 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

First, to demonstrate the annotation process I will describe, we need some genes. Let's assume you're working with human data and that you visit the GENCODE site to grab GENCODE v15 records in GTF format. You can use BEDOPS gtf2bed and BEDOPS sort-bed to convert this to a sorted UCSC BED-formatted file containing GENCODE genes:

$ wget
$ gunzip --stdout gencode.v15.annotation.gtf.gz \
    | gtf2bed - \
    | sort-bed - \
    | grep "\tgene\t" \
    > gencode.v15.annotation.gtf.genes.bed

You can convert your Cufflinks-sourced GTF file to BED with the same process: Run it through BEDOPS gtf2bed, sorting it with BEDOPS sort-bed.

Now that you have both datasets in BED format, you can now map the Cuffinks elements against the GENCODE-sourced BED file with BEDOPS bedmap, which returns Cufflinks transcripts that overlap GENCODE genes by one or more bases, e.g.:

$ gtf2bed < cufflinks_transcripts.gtf \
    | sort-bed - \
    | bedmap --echo --echo-map-id - gencode.v15.annotation.gtf.genes.bed \
    > answer.bed

(To make this analysis faster and neater, I merged the Cufflinks-conversion and -mapping steps.)

The file answer.bed contains the annotation results. This file is a sorted BED file that contains lines of the format:

[ transcript1 ] | [ gene1-overlapping-transcript1 ] ; ... ; [ geneA-overlapping-transcript1 ]
[ transcript2 ] | [ gene1-overlapping-transcript2 ] ; ... ; [ geneB-overlapping-transcript2 ]
[ transcriptN ] | [ gene1-overlapping-transcriptN ] ; ... ; [ geneZ-overlapping-transcriptN ]

The pipe character (|) delimits BED-formatted transcript elements from the IDs or names of overlapping genes. The semi-colon (;) delimits names of overlapping genes for a given transcript.

You can replace these delimiters if you need to with the --delim and --multidelim operators, but basically this lets you easily post-process the results with Perl, Python, awk — whatever scripting or other language you prefer.

If you want the entire gene record and not just the gene name, use --echo-map in place of --echo-map-id. Other --echo-map-* operators are available to recover different attributes of the mapped element. For example, if you want a non-redundant list of gene name annotations, use --echo-map-id-uniq to remove duplicates.

The default criteria for overlap between transcript and mapped gene is one or more bases. If you want this to be more stringent, review the overlap criteria section of the BEDOPS bedmap documentation.

This demo assumes that we are working with human data, but the process for annotation is identical regardless of organism: Get the reference and mapping inputs (here, transcripts and genes, resp.) into sorted BED files, and then process them with BEDOPS bedmap.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Alex Reynolds29k

Thanks for the help. I get an error in the last step while using the bedmaps tool.

It says: dyld: lazy symbol binding failed: Symbol not found: __ZNSt8__detail15_List_node_base7_M_hookEPS0_.

I'm not sure what that means. Do you have any idea?

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Jordan1.1k

Others have reported seeing this error on OS X ( If you use OS X and you see this error, you would need to compile the source code with MacPorts-sourced gcc 4.7.2 until I have a fix available.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Alex Reynolds29k

I have created a beta v2.1 BEDOPS installer for Mac OS X Intel 10.5-10.8, which attempts to address this specific issue. If you are interested, I would welcome feedback on it. It is available here:

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2248 users visited in the last hour