I assembled my RNA-seq data using cufflinks
. This computed FPKM for each gene assembled, and I converted it manually to TPM using: [FPKM/sum(FPKM)]*1e6
I have a gene in all replicates that reaches 600k TPM, meaning that 60% of my transcripts are coming from this gene. I checked his length, 64nt in all of my replicates. And the number of reads matching it (20), in the first replicate.
Is it possible that this transcript is creating a biais in my data? Is it normal that only few reads can create a so high TPM? Maybe miss-mapped reads are creating a huge biais?
EDIT: Some more informations about my data:
RNA-seq, polyA library, unstranded, 3 ovary replicates, 3 testis replicates from my favorite species.
I cleaned my data, removed adapters, bad quality reads etc... The FastQC report is nice.
I just blasted my short sequence and I have 2 hits, only on my species. No predicted genes from ensembl or NCBI at this position. In my 3 ovary replicates I have this overTPMed gene (600k, 400k, 400k). And it is not in my testis replicates. When I check my bams with IGV, I see reads mapping this region in testes replicates too (still fewer than in ovaries), even if they are not assembled in the cufflinks output.
I just ordered my TPM table, I saw many other genes with anormal high TPM. I filtered out assembled genes with < 200nt. And rebuild my TPM table. Now the highest gene is 17k TPM. Is it a good thing to do in that case ?
What criteria was used to define "bad" quality? It is not necessary to have
nice
FastQC report to do further analysis. One can be overzealous in the "cleanup" that can introduce some other bias in the data.Since you are working with reproductive organs perhaps the gene (fragment) you are seeing may really be overexpressed?
I did not asked myself those questions, I used fastQC on raw reads, seeing that quality is not optimal and that I had adapters.
Then I used a cleaner (UrQt), and running fastQC on those cleaned reads I had no low quality bases, and no adapters anymore. I assumed that my data was ok for further analyses.
I expect overexpressed genes. But a genes producing 50% of my mRNA. It is too much in my opinion. Checking the real number of reads, which was pretty low, I came here to understand what's going on.