I'm doing RNAseq analysis to determine differential expression patterns. After running a custom R script to determine the cut off between expressed and unexpressed genes (kernel density plot), I get a list with 4 columns.
Ensembl ID Short gene name Locus 10Log FPKM ENSG00000201699 RNU1-59P chr1:144534037-144534199 1.587315 ENSG00000215861 WI2-1896O14.1 chr1:144676873-144679969 0.975001 ENSG00000254539 RP4-791M13.3 chr1:144833167-144835867 -0.38691 ENSG00000225241 RP11-640M9.2 chr1:144593362-144621656 0.982576 ENSG00000203843 PFN1P2 chr1:144612265-144612683 -0.55001 ENSG00000235398 LINC00623 chr1:144275917-144311653 -0.44573 <-- dup ENSG00000235398 LINC00623 chr1:144299757-144341756 0.524043 <-- dup ENSG00000207106 RNVU1-4 chr1:144311212-144311376 0.949058 ENSG00000231360 AL592284.1 chr1:144339737-144521058 -0.37044 ENSG00000236943 RP11-640M9.1 chr1:144456137-144521970 0.434007
These lists averagely contain 25k genes. As you can see, there's a duplicate in this list (
LINC00623). What I want to do is write a script (preferably perl or R, or maybe something like awk in the command line) to find these duplicates based on the Ensembl ID, and remove the line with the lowest FPKM (since it is the same gene averagely on the same locus but apparently cufflinks decides that it is expressed twice). I can write a script to determine this for my example, but the problem is that there can be genes in between the duplicates, so I need to find duplicates in the entire column. I haven't been able to figure this out so I really hope someone can help me.
Thanks in advance!