Hi,
I was trying to do the bulk RNA seq analysis. However my pipeline is generating only 400000-700000 total counts per sample. However our core generate nearly 20000000-25000000 total counts per samples. We tried to resolve the issue by doing different things. We tried with no trimming and trimming of adaptors. However when we ran the QC after trimming and no trimming we had minimum loss of reads. Then we tried changing the reference files but didn't get any improvement. my mapping (using hisat2) is above 90% all the time so I do not think there is a problem with it.
Below is the pipeline I use
- Run QC
- Adaptor trimming 3 Run QC
- Mapping (I use hisat2, wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grcm38_tran.tar.gz)
- Running sam tools (view, sort, flagstat, view)
- counting the reds with htseq - we tried two reference files
wget ftp://ftp.ensembl.org/pub/release-100/gtf/mus_musculus/Mus_musculus.GRCm38.100.gtf.gz and mm10_UCSC_genesymbolNochr.gtf
I think the issue might be the reference file but not know how to fix it. If someone can help me with it I would really appreciate your help.
Thank you
Hashan
You appear to be aligning against the transcripts file and not the genome. If you want to use the transcripts then consider using a program like
salmon
instead. If you wish to align against the genome then use https://cloud.biohpc.swmed.edu/index.php/s/grcm38/download