Question: Transform a GTF file into a data frame in R
3
gravatar for cristian
3.1 years ago by
cristian240
cristian240 wrote:

Hi,

I would like to analyse the content of a GTF file. I am quite able with R and dplyr, so I would like to transform my GTF file into a data frame to facilitate my analysis. Does anybody know of any tool to do this?

Thanks. Best, C.

ADD COMMENTlink modified 15 days ago by D. Puthier330 • written 3.1 years ago by cristian240
2

I am quite able with R

Have you tried anything? If so, show it. Or, find a tutorial, try some code and come back with any errors and edit the OP with code and said errors.

ADD REPLYlink written 3.1 years ago by st.ph.n2.5k

I have got this working:

Bash:

head celegans.gtf
#!genome-build WBcel235
#!genome-version WBcel235
#!genome-date 2012-12
#!genome-build-accession NCBI:GCA_000002985.3
#!genebuild-last-updated 2014-10
V   WormBase    gene    180 329 .   +   .   gene_id "WBGene00197333"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA";
V   WormBase    transcript  180 329 .   +   .   gene_id "WBGene00197333"; transcript_id "cTel3X.2"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA"; transcript_name "cTel3X.2"; transcript_source "WormBase"; transcript_biotype "ncRNA";

R: gtf <- rtracklayer::import('celegans.gtf')

and it returns a well-formatted GRanges object.

However, in R, I cannot import it with read.table():

gtf2 <- read.table('celegans.gtf', header = FALSE)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 38 elements
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by cristian240
23
gravatar for cpad0112
3.1 years ago by
cpad011214k
India
cpad011214k wrote:

try gtf_df=as.data.frame(gtf) after importing via import function from rtracklayer.

code would be:

gtf <- rtracklayer::import('celegans.gtf')
gtf_df=as.data.frame(gtf)`
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by cpad011214k
2

Can you transform your comment into a formal answer, by clicking on "moderate" and selecting the appropriate option, so that cristian can "accept" the answer ? This will make the information clearer for the future readers !

ADD REPLYlink written 3.1 years ago by Charles Plessy2.7k

This is the best answer, thanks!

ADD REPLYlink written 3.1 years ago by cristian240

This worked for me thanks! Really helpful :)

ADD REPLYlink written 2.9 years ago by Gema Sanz70

Please note that a GTF file is a hierarchical file structure as indicated by the 3rd column (the type column) and you might need to subset/filter via this column to extract what you need.

ADD REPLYlink written 23 months ago by kristoffer.vittingseerup3.4k
2
gravatar for cristian
3.1 years ago by
cristian240
cristian240 wrote:

Hi,

This worked:

gtf2 <- read.table('celegans.gtf', header = FALSE, sep = '\t')

gff <- read.delim('wormbase.gff3', header = FALSE, sep = '\t', skip = 8)

I forgot to specify the tab delimiter in the read.table() function. I thought that it was the default but it isn't.

The answer of cpad0112 is much better though because with his way, all the meta information of the 9th column is put in separate columns whereas with my way, all the meta information is all in one column.

Best, C.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by cristian240
3

No idea about the performance of rtracklayer::import, most likely it's pretty optimized but just in case you wanted to forgo that package, here's how I did it for educational purposes (getting to know the GTF format)

library(data.table)
genes <- fread("gencode.basic.gtf")
setnames(genes, names(genes), c("chr","source","type","start","end","score","strand","phase","attributes") )

# [optional] focus, for example, only on entries of type "gene", 
# which will drastically reduce the file size
genes <- genes[type == "gene"]

# the problem is the attributes column that tends to be a collection
# of the bits of information you're actually interested in
# in order to pull out just the information I want based on the 
# tag name, e.g. "gene_id", I have the following function:
extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- strsplit(gtf_attributes, "; ")
  att <- gsub("\"","",unlist(att))
  if(!is.null(unlist(strsplit(att[grep(att_of_interest, att)], " ")))){
    return( unlist(strsplit(att[grep(att_of_interest, att)], " "))[2])
  }else{
    return(NA)}
}

# this is how to, for example, extract the values for the attributes of interest (here: "gene_id")
genes$gene_id <- unlist(lapply(genes$attributes, extract_attributes, "gene_id"))
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Friederike6.3k

Thanks Friederike, great answer, that was very useful. When I tried to extract the last field of the attributes it still had ; appended. To solve that and speed up the function a bit, I adjusted it as follows:

extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- unlist(strsplit(gtf_attributes, " "))
  if(att_of_interest %in% att){
    return(gsub("\"|;","", att[which(att %in% att_of_interest)+1]))
  }else{
    return(NA)}
}

This can be used exactly as above with unlist and lapply.

ADD REPLYlink modified 8 months ago • written 9 months ago by petervangalen70
1
gravatar for shoujun.gu
3.1 years ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

Isn't GTF just a tsv file?

ADD COMMENTlink written 3.1 years ago by shoujun.gu370

Hi,

Yes, with some header lines at the top starting with '#', that should be ignored by default by read.table(). See my comment above to look at the problem I am having importing the GTF as a data frame into R.

ADD REPLYlink written 3.1 years ago by cristian240
0
gravatar for D. Puthier
15 days ago by
D. Puthier330
France/Marseille/Inserm
D. Puthier330 wrote:

Hi, If you want to read the GTF with R you first need to transform it into a table in which the attribute names will be used as header. You can use gtftk tabulate:

gtftk get_example | gtftk tabulate --key '*' --accept-undef  -o example.tsv

Then you can simply load this tsv file into R using read.table

 d <- read.table("example.tsv", header=T, sep="\t")
 View(d)

Best

disclaimer: I'm the developper of pygtftk

ADD COMMENTlink written 15 days ago by D. Puthier330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 944 users visited in the last hour