Hello everybody.
I'm a newbie in RNA-seq Analysis, and I have this situation that I don't really understand.
While working with featureCounts for RNA-seq read quantification, I came across an intriguing issue. The rate of successfully assigned alignments turned out to be unexpectedly low, totalling just 15463270 (7.6%). This was rather surprising, especially considering that HISAT2 yielded a high mapping rate of 97.11% overall alignment rate. Upon scrutinizing the summary file.
Any insights or suggestions regarding this situation would be greatly appreciated.
The Human Reference genome (hg38) and annotation file was download from Ensembl and the index was created using hisat2-build
.
This is the output from HISAT2 from one example. Most of the samples have similar:
hisat2 -p 24 -x ~/Documents/Reference_genomes/H.sapiens_Ensembl/idx/genome -1 ./Data/fastq/P115_L3_1.fq.gz -2 ./Data/fastq/P115_L3_2.fq.gz | samtools sort > ./Results/bam/P115_1mes_EKRN230023385-1A_HFC7FDSX7_L3.bam
And Results
37521707 reads; of these:
37521707 (100.00%) were paired; of these:
1699858 (4.53%) aligned concordantly 0 times
12794982 (34.10%) aligned concordantly exactly 1 time
23026867 (61.37%) aligned concordantly >1 times
----
1699858 pairs aligned concordantly 0 times; of these:
62360 (3.67%) aligned discordantly 1 time
----
1637498 pairs aligned 0 times concordantly or discordantly; of these:
3274996 mates make up the pairs; of these:
2169082 (66.23%) aligned 0 times
623965 (19.05%) aligned exactly 1 time
481949 (14.72%) aligned >1 times
97.11% overall alignment rate
[bam_sort_core] merging from 16 files and 1 in-memory blocks...
The code I run for FeatureCounts is the following one:
cat ids.txt | parallel -j 1 echo "./Results/bam/{}.bam" | \
xargs featureCounts -p -a ~/Documents/Reference_genomes/H.sapiens_Ensembl/Homo_sapiens.GRCh38.110.gtf -o ./Results/counts.txt
========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v2.0.6
//========================== featureCounts setting ===========================\ || || || Input files : 164 BAM files
|| ||
|| ||
P115_L1.bam || ||
P115_L2.bam || ||
P115_L3.bam || ||
P115_L4.bam || |||| || Output file : counts.txt
|| || Summary : counts.txt.summary
|| || Paired-end : yes
|| || Count read pairs : no
|| || Annotation : Homo_sapiens.GRCh38.110.gtf (GTF)
|| || Dir for temp files : ./Results
|| ||
|| || Threads : 23
|| || Level : meta-feature level
|| || Multimapping reads : not counted
|| || Multi-overlapping reads : not counted
|| || Min overlapping bases : 1
|| ||
|| \============================================================================////================================= Running ==================================\ || || || Load annotation file Homo_sapiens.GRCh38.110.gtf ...
|| || Features : 1649677
|| || Meta-features : 62754
|| || Chromosomes/contigs : 47
|| ||
|| || Process BAM file P115_L1.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 204560676
|| || Successfully assigned alignments : 15463270 (7.6%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L2.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 204560676
|| || Successfully assigned alignments : 15463270 (7.6%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L3.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 217887388
|| || Successfully assigned alignments : 18352057 (8.4%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L4.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 217887388
|| || Successfully assigned alignments : 18352057 (8.4%)
|| || Running time : 0.08 minutes
||
Thank you in advance.
I think I found the possible problem. After carry out the fastQC, I have some FAIL in Sequence Duplication Levels and Overrepresented sequences.
I run a blast with the sequences and most of them belongs to globin.
Do you think that is because they didn't carry out depletion hgbRNA depletion?
Thank you for your reply. I how can I identify why I have soo many multimapping reads?
The samples comes from blood from patients under different treatment. Could this be the cause?
That is a characteristic of the libraries. Not much you can do at this point in process. Just having blood from patients with different treatments should not cause this.
I added the fastQC. What do you think of the possible cause I have proposed?
Almost all your samples have the over represented sequence, you can blast the top 10 recurring sequences and see what it could be.
Globin reads shown in your FastQC report are not going to add up to more than 7-8% of the data.
Since globins can account for 30-80% of blood RNA the depletion has certainly worked albeit not 100%.
What I do not really know if they have performed the depletion... These are samples that were sent to NovoGene for sequencing and I have received them now for analysis....
If you data is lnc-RNA which contain mRNA and non-coding RNA?