I have a very basic question. I would like to have unique list of promoters.
Let's say we have Refseq genes downloaded from Table browser (~54k).
If we extend the TSS with whatever kb up and down, how should we make the list unique? gene name or position?
e.g.
chr1 6052357 6161253 NM_001199861 KCNAB2 +
chr1 6086072 6161253 NM_003636 KCNAB2 +
chr1 6094347 6161253 NM_001199860 KCNAB2 +
chr1 6105980 6161253 NM_001199862 KCNAB2 +
chr1 6106173 6161253 NM_001199863 KCNAB2 +
if I unique them by $5 (OFFICIAL name), I will end up with ~26k, but by chr,start,end,strand I end up with ~36K
unique by either end or start could be also one option, but sometimes start is the same sometimes end!!
I prefer to unique by OFFICIAL name.
I would like to know you suggestions.
Thanks
Have a look at the list of promoters defined by the FANTOM project (“CAGE peaks”, http://fantom.gsc.riken.jp/5/data/). Each of them has a unique name, indicating if they belong to a know gene, and if yes, their rank in expression level.