Question

Duplicate gene names in count matrix from GTF file that can't be parsed to edgeR

0

Entering edit mode

7.1 years ago

Chloe • 0

Hi all,

I'm having an issue with duplicate gene names/IDs which are interfering with parsing my count matrix to edgeR.

I have a GFF3 file which I converted to GTF using:

gffread my.gff3 -T -o my.gtf

I am able to produce a count matrix using this gtf file thorugh HTSeq. However, when I try to use edgeR to read the file (I have tried both in command line and using the website Galaxy) it cannot read the file due to duplicate gene names.

This is the specific error I get in edgeR if that helps (I get the same error putting row.names=1):

>  x <- read.delim("counts.txt",row.names="Contig")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
duplicate 'row.names' are not allowed

Looking at the GTF file, while there are multiple lines with the same gene name/ID/transcript ID etc., this is because they are different features of the same gene (the phase and position on the strand of each exon/part of the CDS)

Therefore it seems either I need to tell HTSeq to differentiate identical gene names based on position/feature or I need to convert GFF3->GTF in such a way that each gene name is unique/there is only one line that encompasses all the information it needs

Does anyone know which is the best way to do this (and how??) ?

(I have a feeling I will need to do this by changing some settings in HTSeq, so if anyone knows how to do this in Galaxy that would be amazing)

Many thanks,

Chloe

RNA-Seq genome DGE count matrix edgeR • 3.8k views

ADD COMMENT • link updated 7.0 years ago by ytian • 0 • written 7.1 years ago by Chloe • 0

0

Entering edit mode

Hi Chloe,

There are more solutions possible I think. Best solution is to find a GFF3 or GTF file with unique identifiers, such as ensembl accession for instance.

Other (less optimal) option is to read your table into R without row.names argument. And then assign the row.names later manually from your "Contig" column (although I expect problems when you use read.table with characters and numeric mixed).

ADD REPLY • link 7.1 years ago by Benn 8.3k

score 0 · Answer 1 · 2017-05-02

0

Entering edit mode

7.0 years ago

ytian • 0

I think you can make a list of those duplicate genes with down below codes (I assume your matrix with gene symbol as colnames)

n_occur <- data.frame(table(Data$Columns))

n_occur[n_occur$Freq > 1,]

Then you can proceed to locus information or coding length to verify your gene list

ADD COMMENT • link 7.0 years ago by ytian • 0