Question: Duplicate gene names in count matrix from GTF file that can't be parsed to edgeR
gravatar for Chloe
3.6 years ago by
Queensland University of Technology
Chloe0 wrote:

Hi all,

I'm having an issue with duplicate gene names/IDs which are interfering with parsing my count matrix to edgeR.

I have a GFF3 file which I converted to GTF using:

gffread my.gff3 -T -o my.gtf

I am able to produce a count matrix using this gtf file thorugh HTSeq. However, when I try to use edgeR to read the file (I have tried both in command line and using the website Galaxy) it cannot read the file due to duplicate gene names.

This is the specific error I get in edgeR if that helps (I get the same error putting row.names=1):

>  x <- read.delim("counts.txt",row.names="Contig")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
duplicate 'row.names' are not allowed

Looking at the GTF file, while there are multiple lines with the same gene name/ID/transcript ID etc., this is because they are different features of the same gene (the phase and position on the strand of each exon/part of the CDS)

Therefore it seems either I need to tell HTSeq to differentiate identical gene names based on position/feature or I need to convert GFF3->GTF in such a way that each gene name is unique/there is only one line that encompasses all the information it needs

Does anyone know which is the best way to do this (and how??) ?

(I have a feeling I will need to do this by changing some settings in HTSeq, so if anyone knows how to do this in Galaxy that would be amazing)

Many thanks,


ADD COMMENTlink modified 3.5 years ago by ytian0 • written 3.6 years ago by Chloe0

Hi Chloe,

There are more solutions possible I think. Best solution is to find a GFF3 or GTF file with unique identifiers, such as ensembl accession for instance.

Other (less optimal) option is to read your table into R without row.names argument. And then assign the row.names later manually from your "Contig" column (although I expect problems when you use read.table with characters and numeric mixed).

ADD REPLYlink written 3.6 years ago by Benn8.0k
gravatar for ytian
3.5 years ago by
ytian0 wrote:

I think you can make a list of those duplicate genes with down below codes (I assume your matrix with gene symbol as colnames)

n_occur <- data.frame(table(Data$Columns))

n_occur[n_occur$Freq > 1,]

Then you can proceed to locus information or coding length to verify your gene list

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by ytian0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour