I have produced *.refmap and fpkm_tracking files using the Tophat2 > Cufflinks > Cuffcompare pipeline starting from RNA-seq fastq files and aligned with hg19. In some of the *.refmap files, I have ~230K total rows, and while I have the same number of total unique start/stop locus positions, I have genes with up to ~57K duplicates (using the table() function in R). I plotted the distribution of gene name duplicate counts (log transformed) for one of the samples My question is, is this normal? My goal is to perform regression analysis using the FPKM values among the samples for consensus isoforms and using PCA and clusting analysis to determine population differentiation, but given that there are only around 22K refseq genes, I would like to know how to process these RNA-seq data.