I am new with RNA-Seq bioingormatic analysis and I am having some troubles when comparing results from visualization and final Ballgown outputs.
I have 6 RNA-Seq samples with two replicates for each one, and run in two different lines (So I have 4 fastq files for each sample).
First I have performed the fastQC analysis, trimmed the fastq files and reperformed the fastQC, to prove that more or less all the replicates have same characteristics.
Then I have run hisat2 for alignment or the reads separatedly for eac sample. I have converted the sam files to bam files and I have decide to merge all bam files belonging to the same sample to create a unique bam file for each sample and then the tdf file to visualize it. I understand that, when I merge the bam files the read count is summed so the total "expresion" of each sample gets bigger (but I don't know if my conclussion is wrong).
Next I have return to de individual bam files and I have used Stringtie to assembly the transcripts with the reference genome, first. Then merged the transcripts for all samples, and finally, rerun the stringtie for transcripts assembly but this time with the merged file instead off the reference genome.
And here, with the stringtie outputs I can see some differences respect what I see in the tdf files visualization. I have some transcripts without coverage values in stringtie, that I can see with relative high signal in the tdf file. I don't know if it is because in the tdf files the signal is summed in all samples that I merged previously, but then individually this signal is not enought for stringtie to consider coverage.
So then when running ballgown with all replicates (4 files) for each sample, and visualizing the output differential expression results, I see big differences for some genes, that are not true if I compare with the tdf file. The results do not match.
So, I have some questions about how to handle with this... or how is the best strategy to perform the analysis.
- First do I need to concatenate fastq files or merge bam files of the same sample run in different lines, for the begining or it is better to do the analises separatelly?
- It is fine to merge the bam files to have a unique bam file for each sample to convert to tdf and visualize it in IGV or is another way to do it that is better, and overlaps the tracks instead of merge or summing the reads. I don't know if it is correct the fact that it sums the intensityes or it is wrong.
- For Stringtie is it a good idea to have all the files as individuals or is better to have a unique file for each sample (I mean to have summed coverage value for each sample or the individual values for each replicate that will output an average value).
- Do Stringtie serve as a normalization to then compare all 6 samples between them or do I need to include another normalization step?
I don't know if I have manage to explain myself but I really apreciate if someone could help me, because I am totally lost...
And also I don't know if someone has had the same problem, but I really didn't know how to find it in biostars. So sorry if I am repeating myself.
Thanks in advance,