I am trying to remove columns from a GTF file so I can perform featureCount analysis using the GTF file.
I am only wanting to keep the columns which contain the geneID, chr, start, end and strand data and discard the rest of the columns. I have managed to do this using R by putting the file in a data frame however I am unsure as to how to convert it back to a GTF file from the data frame:
install.packages('BiocManager')
library(BiocManager)
install()
BiocManager::install("Rsubread")
library(Rsubread)
# opening the gtf file as a dataframe
gtf <- rtracklayer::import('data.gtf')
gtf_df=as.data.frame(gtf)
df = subset(gtf_df, select = -c(seqnames,width,source,type,score,phase,
transcript_id,gbkey,gene_biotype,
locus_tag,old_locus_tag,protein_id,transl_table,
exon_number,gene,Ontology_term,go_component,
go_function,go_process,
anticodon,transcript_biotype,partial,pseudo,
note,db_xref,exception,product,inference))
Is there a way to edit the GTF file using linux? Or a way to make my dataframe from R into a GTF file?
featureCounts
will understand a properly formatted GTF file, are you not able to use it as is? What you are trying to create is theSimple Annotation Format (SAF )
format file. It will no longer be in GTF format.featureCounts
can use SAF format files as well.