For an RNA-Seq data set, I performed transcription start site usage analysis using cufflinks. The workflow was as follows: multiple replicates of each genotype (2) sequenced on an Illumina platform; the reads were aligned using STAR; performed assembly using cufflinks; assembled gtf files were merged using cuffmerge (with reference annotation included). Isoform quantification was performed using cuffquant.
I wanted to analyse differential TSS usage. However when I actually look at the differentially expressed TSS, I see that this includes several "novel" TSS, but many of these have a start which differs only 1 nucleotide from the reference. Is this just a mmapping issue, and what can I do to systematcally indentify and remove these that seem to be false positive novel TSS?