I am working with RNA-seq data of different drug treated human cell lines sequenced on Illumina (2X150 bp chemistry). Data generated is 11-14 Gb.
I am using
HISAT2 for alignment on human genome hg38 build downloaded from Ensembl database using default parameters. To my surprise, the alignment percentage is very low (
~2 %) for all the samples.
I have the following observations regarding the data quality :
- Data quality (phred score) is good i.e. in the range of
- Illumina adapters were already removed using trimmomatic. No other contaminants.
- The fastqc metric 'Per base sequence content' fails for all samples. More specifically, A,T,G,C % is deviating considerably at the first 10 base positions.
- What could possibly be wrong? Any other QC stuff I need to re-consider?
- More importantly, this is cell-line data, so shall I consider mapping on specific cell line reference OR is it okay if I have just the hg38 standard human reference?