Hi,
I'm somewhat new to bioinformatics, so please bear with me.
I'm running tophat2 on some fastq files using the HG38 as reference. This is the command that I ran:
tophat2 --b2-sensitive -G /home/fadhil/hg38_ref/lib/hg38.refGene.gtf -p 16 -o /home/data/mcf10_tophat_output /home/fadhil/Bowtie2Index/genome ./SRR925720_mcf10a.fastq
It takes about 8 hours, but in the end the mapping rate is almost 0%, it maps 3997 out of 31898079 reads. I'm not sure I understand why this is happening, although tophat emitted the following error consecutively as it was running:
Warning: Encountered reference sequence with only gaps
Ignoring any potential errors with the fastq files themselves, what could possibly be the problem here?
You should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.