Question

STAR alignment and HTseq-count

0

Entering edit mode

16 months ago

otieno43 ▴ 30

Hey guys,

I have RNA-seq data, five biological replicates for each condition treated the same way. Performed quality checks and did some trimming to the raw datasets, followed by mapping using star aligner. I then counted reads that mapped per gene using HTseq-count. When now I looked at the numbers of reads that mapped per gene, I have realized that there some genes that number of reads are significantly different between replicates within a condition. An most of them come up to be significantly deferentially expressed when expression of genes between conditions are considered. Example: Wyc_1 WYC_5 YRC_2 A1 23 100 1 A2 48 58 0 A3 30 2500 5 A4 70 3780 3 A5 12 70 2 B1 14 120 3 B2 45 76 7 B3 28 69 0 B4 12500 43 23352 B5 18340 64 17453 UP in B DOWN in A UP in B

A and B are conditions and the numbers next to them are just replicates. The numbers below them are counts as generated by HTseq-count. Using RSEM to count reads and determine expression gave similar results. These are deferentially expressed after DSEq2 analysis. I do not have confidence in pointing out that they are deferentially expressed and they seem to be important candidates for functional analysis.

How is this possible for a single to exhibit this kind significant differences within the same condition? Is it the samples or the sequencing process/depth? The sequencing depths per biological replicate seems similar. Is it possible to trust these genes as deferentially expressed? How can one correct this?

Thanks for your help/insight.

and alignment HTseq-count STAR • 1.1k views

ADD COMMENT • link 16 months ago by otieno43 ▴ 30

0

Entering edit mode

I think we would be used to seeing the data like this:

       A1 A2   A3   A4 A5  B1 B2 B3    B4    B5
Wyc_1  23 48   30   70 12  14 45 28 12500 18340
WYC_5 100 58 2500 3780 70 120 76 69    43    64
YRC_2   1  0    5    3  2   3  7  0 23352 17453

If A and B are replicates, you can see that some of your samples differ by 3 and 4 orders of magnitude for individual genes within a condition. This is a bit much. There's likely more going on here than sequencing depth. Are the sequencing depths comparable between your samples? Are the reads distributed similarly across all the genes when comparing the replicates? Can you find a central core of "housekeeping" genes with similar expression values, or do they also vary like these genes do? If you do pairs plots between all samples (i.e. plot log(counts+1) of all samples against each other), do the scatter plots between replicates within a condition look closer than between conditions? This could also be an error of some kind. Are you sure the samples were demultiplexed properly, or are not mixed up? Is there a control gene (i.e. a knockout) or something known to be induced by condition that you can check to make sure the samples weren't mixed up?

ADD REPLY • link 16 months ago by seidel 11k

0

Entering edit mode

Thanks Seidel for your response.

Evaluating read counts per gene, I can say 99% of genes have reads distributed evenly within replicates. These are just hypothetical examples of some genes that passed my cut-off of having at least 10 reads in over 50% of the replicates per condition. Apparently, they came up to be significantly expressed between conditions and seems to have some functions worth evaluating.

I did run Pearson correlation to determine how replicates withing conditions behave, they have very strong correlation >9.

The housekeeping genes exhibit similar expression between conditions and replicates.

The samples are not mixed up. The As are the control group and Bs are the experimental group.

I think it is just some spurious mapping error for few genes? I don't know.

ADD REPLY • link 16 months ago by otieno43 ▴ 30