How to define intergenic regions from cufflinks .gtf-file (separate introns from intergenic)
1
0
Entering edit mode
8.6 years ago
jon.brate ▴ 290

It is easy to define intergenic regions if the gff/gtf file contains a "gene" line (see: http://davetang.org/muse/2013/01/18/defining-genomic-regions/). But I am using a gtf-file generated by cufflinks, and it only uses exon lines. How can I separate the intronic from the intergenic regions based on such a file?

cufflinks gtf intergenic • 3.8k views
ADD COMMENT
1
Entering edit mode
8.6 years ago

The general steps are as follows:

  1. Load GTF file into R (see the rtracklayer and GenomicRanges packages). This should result in a GRanges object.
  2. split() the result of step 1 by gene_id. You now have a GRangesList.
  3. lapply() a function to return a data.frame containing the following: chromosome, min(start), max(end).
  4. Convert the result of 3 to a GRanges object.
  5. reduce() the result of step 4.
  6. Run gaps() on the result of step 5. Congrats, you're done!

I have a script somewhere that does all of that, but I've sketched out enough to get you started.

ADD COMMENT
0
Entering edit mode

Thanks for the advice!

ADD REPLY
0
Entering edit mode

I can make lists of the start and end positions of each gene by e.g.: gene.start = lapply(object, start)

But I don't quite understand how to get the chromosome names. I tried lapply(object, seqnames) and seqnames(object) but how to combine with the start and end coordinates?

Edit: I found a sligthly different solution here: https://support.bioconductor.org/p/66003/

gtf = makeTxDbFromGFF("mygtf.gtf", format = "gtf")

gene = exonsBy(gtf, "gene")
intergenic = gaps(unlist(range(gene)))
export.gff(intergenic, "intergenic.gff", format="gff")
ADD REPLY

Login before adding your answer.

Traffic: 2665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6