Hi, I'm totaly new to this field so I'm greatful for any tips. Also please excuse any mistakes.
I've gotten human RNA-seq raw data for analysis. The experiment has been performed on an Illumina Instrument in paired end mode (hence two files per sample) with the NuGEN reagents and should be rRNA depleted and directional.
first question: nevertheless the information that it's directional I made a strandness test which suggests the usage of an unstranded protocol.
The alignment with HISAT2 in paired end mode with the parameter "second strand" (thats what I read to be used for NuGen) yields:
16666260 reads; of these:
16666260 (100.00%) were paired; of these:
10321799 (61.93%) aligned concordantly 0 times
6052815 (36.32%) aligned concordantly exactly 1 time
291646 (1.75%) aligned concordantly >1 times
----
10321799 pairs aligned concordantly 0 times; of these:
31463 (0.30%) aligned discordantly 1 time
----
10290336 pairs aligned 0 times concordantly or discordantly; of these:
20580672 mates make up the pairs; of these:
19949833 (96.93%) aligned 0 times
473079 (2.30%) aligned exactly 1 time
157760 (0.77%) aligned >1 times
40.15% overall alignment rate
With TopHat it yields:
Left reads:
Input : 16666260
Mapped : 6172883 (37.0% of input)
of these: 472887 ( 7.7%) have multiple alignments (97823 have >20)
Right reads:
Input : 16666260
Mapped : 6172883 (37.0% of input)
of these: 472887 ( 7.7%) have multiple alignments (97823 have >20)
37.0% overall read mapping rate.
Aligned pairs: 6172883
of these: 472887 ( 7.7%) have multiple alignments
230048 ( 3.7%) are discordant alignments
35.7% concordant pair alignment rate.
I used trimmomatic for careful trimming in order to get better results: but the improvement is only 1%. Do you suggest to do some stronger trimming?
because of the strandness test I tried to run it as unstranded with HISAT2:
16666260 reads; of these:
16666260 (100.00%) were paired; of these:
10321799 (61.93%) aligned concordantly 0 times
6052815 (36.32%) aligned concordantly exactly 1 time
291646 (1.75%) aligned concordantly >1 times
----
10321799 pairs aligned concordantly 0 times; of these:
31463 (0.30%) aligned discordantly 1 time
----
10290336 pairs aligned 0 times concordantly or discordantly; of these:
20580672 mates make up the pairs; of these:
19949833 (96.93%) aligned 0 times
473079 (2.30%) aligned exactly 1 time
157760 (0.77%) aligned >1 times
40.15% overall alignment rate
Now, do these percentages indicate, that the data is
poor? Or is there something else I can do?
also: I month later I got a second batch of data: with the following message:
"...requested two lanes of sequencing. I actually stopped the second lane as after the first batch of sequences came out one smaple was massively over-represented. I remeasured everything (same results), and re-pooled with significantly lower mix of this one sample, it didn't seem to change things too much unfortunately. But this is why there was a delay in the delivery of the second lane."
do you think these are replicates? Or do they belong together, so that I have to combine the data so that 4 files represent one sample?
many thanks for any help in this mess!
Since you have only ~40% alignment it is very likely that you still have rRNA in your samples (assuming that you have no rRNA in your reference genome). If you do have rRNA in your reference then you will want to grab a sample of reads that do not align and blast them at NCBI to make sure you don't have contamination in your data. You can also align your data to human rDNA repeat to see if you are able to detect RNA.
If you do see rRNA then the kit may not have worked as advertised and you will be within your right to contact the facility to see what they can do to help.
Which test?
Are these the same samples run again or biological replicates that were processed separately (which would be a bad idea).
That does not sound very good. Your samples may not be of very good quality (I am going to give the facility benefit of doubt that they know what they are doing).
You will know/need to find that out.