I assembled my RNA-seq data using
cufflinks. This computed FPKM for each gene assembled, and I converted it manually to TPM using:
I have a gene in all replicates that reaches 600k TPM, meaning that 60% of my transcripts are coming from this gene. I checked his length, 64nt in all of my replicates. And the number of reads matching it (20), in the first replicate.
Is it possible that this transcript is creating a biais in my data? Is it normal that only few reads can create a so high TPM? Maybe miss-mapped reads are creating a huge biais?
EDIT: Some more informations about my data:
RNA-seq, polyA library, unstranded, 3 ovary replicates, 3 testis replicates from my favorite species.
I cleaned my data, removed adapters, bad quality reads etc... The FastQC report is nice.
I just blasted my short sequence and I have 2 hits, only on my species. No predicted genes from ensembl or NCBI at this position. In my 3 ovary replicates I have this overTPMed gene (600k, 400k, 400k). And it is not in my testis replicates. When I check my bams with IGV, I see reads mapping this region in testes replicates too (still fewer than in ovaries), even if they are not assembled in the cufflinks output.
I just ordered my TPM table, I saw many other genes with anormal high TPM. I filtered out assembled genes with < 200nt. And rebuild my TPM table. Now the highest gene is 17k TPM. Is it a good thing to do in that case ?