Why my Hisat2 result is 0% aligned concordantly and HTseq result is 100% not aligned.
1
0
Entering edit mode
3.8 years ago
guotingfeng ▴ 10

Hi dear all, I have several confused result for my RNAseq data alignment and analysis. I used Hisat2 to map my Bacillus 168 RNAseq data into downloaded genome (fasta) from NCBI. I found all none of alignment is concordantly. When I used this output to counting the number of read through HTseq, the report said I only got not-aligned read. I changed several parameters, but it didn't improve. Could you help me please? I will write down my scripts and process. Thank you very much.

1. I downloaded genome from NCBI and used Hisat2-building to make genome index.

module load HISAT2/2.1.0-intel-2017A-Python-2.7.12
pe_1="genomereference"
genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/sequence.fasta'

**hisat2-build -p $threads -f -c$genome_index_prefix $pe_1**  Then I got these 8 files: genomereference.1.ht2 genomereference.2.ht2 genomereference.3.ht2 genomereference.4.ht2 genomereference.5.ht2 genomereference.6.ht2 genomereference.7.ht2 genomereference.8.ht2  2. I used my fastq.zg data and genome index to do the mapping through Hisat2 and samtools. module load HISAT2/2.1.0-intel-2017A-Python-2.7.12 module load SAMtools/1.6-GCCcore-6.3.0 pe_1='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R1_001.fastq.gz' pe_2='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R2_001.fastq.gz' genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/genomereference' id='1Mggroup' library='bacillus' platform='ILLUMINA' sample='1Mg' output_bam="/scratch/user/guotingfeng/RNAseq/Hisat2/${sample}_pe_aln.sam"

**hisat2 --dta -p $threads --rg-id "$id" --rg "LB:$library" --rg "SM:$sample" --rg "PL:$platform" -x$genome_index_prefix -q -1 $pe_1 -2$pe_2 -S out.sam**

**samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T$sample -n -o output_bam** rm out.sam  The output report is wird here:  **25854795 reads; of these:** 25854795 (100.00%) were paired; of these: **25854795 (100.00%) aligned concordantly 0 times,** 0 (0.00%) aligned concordantly exactly 1 time, 0 (0.00%) aligned concordantly >1 times.** ---- 25854795 pairs aligned concordantly 0 times; of these: 0 (0.00%) aligned discordantly 1 time, ---- 25854795 pairs aligned 0 times concordantly or discordantly; of these: 51709590 mates make up the pairs; of these: 51709590 (100.00%) aligned 0 times, 0 (0.00%) aligned exactly 1 time, 0 (0.00%) aligned >1 times. 0.00% overall alignment rate.  3. I still used this output to do the counting by HTseq-count. module load HTSeq/0.9.1-intel-2017A-Python-2.7.12 pe1_1='/scratch/user/guotingfeng/RNAseq/Hisat2/1_pe_aln.sam' pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf' sample='/scratch/user/guotingfeng/RNAseq/HTseq/Hisat2/bwamem_1sam.txt' **htseq-count -f sam -s yes -m union -r name -t exon -i gene_id --additional-attr=gene_namepe1_1 $pe1_2 >$sample**


The output is weird: I didn't get any read in each genes and it said all the reads are not aligned.

**gene:BSU38850   yxkC    0**

**gene:BSU38860   galE    0**

**gene:BSU38870   yxkA    0**

**gene:BSU38880   yxjO    0**

**gene:BSU38890   yxjN    0**

__no_feature            0

__ambiguous             0

__too_low_aQual         0

**__not_aligned           24291367**

__alignment_not_unique          0


So anyone have any ideas? Thank you very much. Have a good day.

RNA-Seq alignment Hisat2 HTseq • 2.9k views
0
Entering edit mode

The HTSeq part makes sense; that it's not reporting anything if 0% of reads aligned. Can you try to rerun HISAT2 with the most basic parameters? e.g. something like this to start:

hisat2 -p $threads -x$genome_index_prefix -q -1 $pe_1 -2$pe_2 -S out.sam

If that works, I would try adding your other parameters one by one (--dta, then read groups, platform etc.) and see what causes the issue.

0
Entering edit mode

I did as what you said to delete all other parameters and rerun the program. The result is totally same.

hisat2 -p $threads -x$genome_index_prefix -q -1 $pe_1 -2$pe_2 -S out.sam

samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T$sample -n -o output_bam The output of Sam is here: @HD VN:1.0 SO:coordinate @SQ SN:0 LN:33 @PG ID:hisat2 PN:hisat2 VN:2.1.0 CL:"/sw/eb/software/HISAT2/2.1.0-intel-2017A-Python-2.7.12/bin/hisat2-align-s --wrapper basic-0 -p 20 -x /scratch/user/guotingfeng/RNAseq/genomereference/genomereference -q -S out.sam -1 /tmp/2611.inpipe1 -2 /tmp/2611.inpipe2" K00333:90:H3TMTBBXY:2:1101:20070:1947 77 * 0 0 * * 0 0 CCTGTGTAATCTGCCCGCGGGCGGGTCATCCGGAGATTGTTAAACTGCTCAAGAATATTAAAGAAGCAGGCTGAC YT:Z:UP K00333:90:H3TMTBBXY:2:1101:20070:1947 141 * 0 0 * * 0 0 NGCATTACTGAATAAATCGGTGATGTTAAAGGGACTCTTCACGTAGGAACTTATAAAAGGGTAAGTACAATTTTG YT:Z:UP Report is still same, 25854795 reads; of these: 25854795 (100.00%) were paired; of these: **25854795 (100.00%) aligned concordantly 0 times** 0 (0.00%) aligned concordantly exactly 1 time 0 (0.00%) aligned concordantly >1 times  I found SN:0 in the Sam file, is that normal? Does it mean there is some problem for my genome reference data or index file? How can I check that? Thank you very much. Have a good night. ADD REPLY 1 Entering edit mode 3.8 years ago I see two possible options. I think most likely there's an issue with your hisat2-build command. The -c switch expects the sequences to be provided by the standard input, not from a file. It looks like you're providing a FASTA file, so that would not be needed. Here's my hisat2-build command: "HISAT_PATH"/"hisat2-build" "$HISAT_REFERENCE_FASTA" "$ASSEMBLY_OUT"/"transcriptome"


In your case, you probably want just

hisat2-build -p $threads$genome_index_prefix \$pe_1


Otherwise when you run hisat, I'm not sure but I think the -q switch is not essential.

-q Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files usually have extension .fq or .fastq. FASTQ is the default format. See also: --solexa-quals and --int-quals.

The documentation isn't too clear, but I believe by default HISAT2 expects .fastq.gz files (which is what you have), and I'm not sure but -q may imply that your file is not gzipped. If you find out, could you let me know whether it matters or not? :)

I would try remaking the index first and see if that fixes it, then try aligning with/without -q and compare (though definitely I've processed .fasta.gz files without it before).

1
Entering edit mode

Awesome, manuel.belmadani. Thank you very much. I will change them and see.

0
Entering edit mode

Hi manuel.belmadani, I have tried either of changes. -c is the key factors. When I delete it, my reference index size changes a lot and it works now. Thank you for your help. I really appreciate it.

0
Entering edit mode

Good to hear! You should mark the answer as accepted so others who find this post can know it's resolved.