Question: Transform a GTF file into a data frame in R
3
gravatar for cristian
2.8 years ago by
cristian240
cristian240 wrote:

Hi,

I would like to analyse the content of a GTF file. I am quite able with R and dplyr, so I would like to transform my GTF file into a data frame to facilitate my analysis. Does anybody know of any tool to do this?

Thanks. Best, C.

ADD COMMENTlink modified 20 months ago by Biostar ♦♦ 20 • written 2.8 years ago by cristian240
2

I am quite able with R

Have you tried anything? If so, show it. Or, find a tutorial, try some code and come back with any errors and edit the OP with code and said errors.

ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

I have got this working:

Bash:

head celegans.gtf
#!genome-build WBcel235
#!genome-version WBcel235
#!genome-date 2012-12
#!genome-build-accession NCBI:GCA_000002985.3
#!genebuild-last-updated 2014-10
V   WormBase    gene    180 329 .   +   .   gene_id "WBGene00197333"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA";
V   WormBase    transcript  180 329 .   +   .   gene_id "WBGene00197333"; transcript_id "cTel3X.2"; gene_name "cTel3X.2"; gene_source "WormBase"; gene_biotype "ncRNA"; transcript_name "cTel3X.2"; transcript_source "WormBase"; transcript_biotype "ncRNA";

R: gtf <- rtracklayer::import('celegans.gtf')

and it returns a well-formatted GRanges object.

However, in R, I cannot import it with read.table():

gtf2 <- read.table('celegans.gtf', header = FALSE)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 38 elements
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by cristian240
21
gravatar for cpad0112
2.8 years ago by
cpad011213k
India
cpad011213k wrote:

try gtf_df=as.data.frame(gtf) after importing via import function from rtracklayer.

code would be:

gtf <- rtracklayer::import('celegans.gtf')
gtf_df=as.data.frame(gtf)`
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by cpad011213k
2

Can you transform your comment into a formal answer, by clicking on "moderate" and selecting the appropriate option, so that cristian can "accept" the answer ? This will make the information clearer for the future readers !

ADD REPLYlink written 2.8 years ago by Charles Plessy2.7k

This is the best answer, thanks!

ADD REPLYlink written 2.8 years ago by cristian240

This worked for me thanks! Really helpful :)

ADD REPLYlink written 2.6 years ago by Gema Sanz70

Please note that a GTF file is a hierarchical file structure as indicated by the 3rd column (the type column) and you might need to subset/filter via this column to extract what you need.

ADD REPLYlink written 20 months ago by kristoffer.vittingseerup3.3k
2
gravatar for cristian
2.8 years ago by
cristian240
cristian240 wrote:

Hi,

This worked:

gtf2 <- read.table('celegans.gtf', header = FALSE, sep = '\t')

gff <- read.delim('wormbase.gff3', header = FALSE, sep = '\t', skip = 8)

I forgot to specify the tab delimiter in the read.table() function. I thought that it was the default but it isn't.

The answer of cpad0112 is much better though because with his way, all the meta information of the 9th column is put in separate columns whereas with my way, all the meta information is all in one column.

Best, C.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by cristian240
3

No idea about the performance of rtracklayer::import, most likely it's pretty optimized but just in case you wanted to forgo that package, here's how I did it for educational purposes (getting to know the GTF format)

library(data.table)
genes <- fread("gencode.basic.gtf")
setnames(genes, names(genes), c("chr","source","type","start","end","score","strand","phase","attributes") )

# [optional] focus, for example, only on entries of type "gene", 
# which will drastically reduce the file size
genes <- genes[type == "gene"]

# the problem is the attributes column that tends to be a collection
# of the bits of information you're actually interested in
# in order to pull out just the information I want based on the 
# tag name, e.g. "gene_id", I have the following function:
extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- strsplit(gtf_attributes, "; ")
  att <- gsub("\"","",unlist(att))
  if(!is.null(unlist(strsplit(att[grep(att_of_interest, att)], " ")))){
    return( unlist(strsplit(att[grep(att_of_interest, att)], " "))[2])
  }else{
    return(NA)}
}

# this is how to, for example, extract the values for the attributes of interest (here: "gene_id")
genes$gene_id <- unlist(lapply(genes$attributes, extract_attributes, "gene_id"))
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Friederike5.7k

Thanks Friederike, great answer, that was very useful. When I tried to extract the last field of the attributes it still had ; appended. To solve that and speed up the function a bit, I adjusted it as follows:

extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- unlist(strsplit(gtf_attributes, " "))
  if(att_of_interest %in% att){
    return(gsub("\"|;","", att[which(att %in% att_of_interest)+1]))
  }else{
    return(NA)}
}

This can be used exactly as above with unlist and lapply.

ADD REPLYlink modified 5 months ago • written 5 months ago by petervangalen60
1
gravatar for shoujun.gu
2.8 years ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

Isn't GTF just a tsv file?

ADD COMMENTlink written 2.8 years ago by shoujun.gu370

Hi,

Yes, with some header lines at the top starting with '#', that should be ignored by default by read.table(). See my comment above to look at the problem I am having importing the GTF as a data frame into R.

ADD REPLYlink written 2.8 years ago by cristian240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour