Question

Why my Hisat2 result is 0% aligned concordantly and HTseq result is 100% not aligned.

0

Entering edit mode

5.1 years ago

guotingfeng ▴ 10

Hi dear all, I have several confused result for my RNAseq data alignment and analysis. I used Hisat2 to map my Bacillus 168 RNAseq data into downloaded genome (fasta) from NCBI. I found all none of alignment is concordantly. When I used this output to counting the number of read through HTseq, the report said I only got not-aligned read. I changed several parameters, but it didn't improve. Could you help me please? I will write down my scripts and process. Thank you very much.

1. I downloaded genome from NCBI and used Hisat2-building to make genome index.

module load HISAT2/2.1.0-intel-2017A-Python-2.7.12
pe_1="genomereference"
genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/sequence.fasta'

**hisat2-build -p $threads -f -c $genome_index_prefix $pe_1**

Then I got these 8 files:

genomereference.1.ht2
genomereference.2.ht2
genomereference.3.ht2
genomereference.4.ht2
genomereference.5.ht2
genomereference.6.ht2
genomereference.7.ht2
genomereference.8.ht2

2. I used my fastq.zg data and genome index to do the mapping through Hisat2 and samtools.

module load HISAT2/2.1.0-intel-2017A-Python-2.7.12
module load SAMtools/1.6-GCCcore-6.3.0

pe_1='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R1_001.fastq.gz'
pe_2='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R2_001.fastq.gz'
genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/genomereference'
id='1Mggroup'
library='bacillus'
platform='ILLUMINA'
sample='1Mg'
output_bam="/scratch/user/guotingfeng/RNAseq/Hisat2/${sample}_pe_aln.sam"

**hisat2 --dta -p $threads --rg-id "$id" --rg "LB:$library" --rg "SM:$sample" --rg "PL:$platform" -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam**

**samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T $sample -n -o $output_bam**

rm out.sam

The output report is wird here:

  **25854795 reads; of these:**

     25854795 (100.00%) were paired; of these:

      **25854795 (100.00%) aligned concordantly 0 times,**

    0 (0.00%) aligned concordantly exactly 1 time,

    0 (0.00%) aligned concordantly >1 times.**

    ----
    25854795 pairs aligned concordantly 0 times; of these:

      0 (0.00%) aligned discordantly 1 time,

    ----
    25854795 pairs aligned 0 times concordantly or discordantly; of these:

      51709590 mates make up the pairs; of these:

        51709590 (100.00%) aligned 0 times,

        0 (0.00%) aligned exactly 1 time,

        0 (0.00%) aligned >1 times.

0.00% overall alignment rate.

3. I still used this output to do the counting by HTseq-count.

module load HTSeq/0.9.1-intel-2017A-Python-2.7.12

pe1_1='/scratch/user/guotingfeng/RNAseq/Hisat2/1_pe_aln.sam'
pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf'
sample='/scratch/user/guotingfeng/RNAseq/HTseq/Hisat2/bwamem_1sam.txt'

**htseq-count -f sam -s yes -m union -r name -t exon -i gene_id --additional-attr=gene_name $pe1_1 $pe1_2 > $sample**

The output is weird: I didn't get any read in each genes and it said all the reads are not aligned.

**gene:BSU38850   yxkC    0**

**gene:BSU38860   galE    0**

**gene:BSU38870   yxkA    0**

**gene:BSU38880   yxjO    0**

**gene:BSU38890   yxjN    0**


__no_feature            0

__ambiguous             0

__too_low_aQual         0

**__not_aligned           24291367**

__alignment_not_unique          0

So anyone have any ideas? Thank you very much. Have a good day.

RNA-Seq alignment Hisat2 HTseq • 3.9k views

ADD COMMENT • link updated 5.1 years ago by GenoMax 141k • written 5.1 years ago by guotingfeng ▴ 10

0

Entering edit mode

The HTSeq part makes sense; that it's not reporting anything if 0% of reads aligned. Can you try to rerun HISAT2 with the most basic parameters? e.g. something like this to start:

hisat2 -p $threads -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam

If that works, I would try adding your other parameters one by one (--dta, then read groups, platform etc.) and see what causes the issue.

ADD REPLY • link 5.1 years ago by manuel.belmadani ★ 1.3k

0

Entering edit mode

Hi manuel.belmadani, Thank you for your reply.

I did as what you said to delete all other parameters and rerun the program. The result is totally same.

hisat2 -p $threads -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam

samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T $sample -n -o $output_bam

The output of Sam is here:

@HD VN:1.0 SO:coordinate @SQ SN:0 LN:33 @PG ID:hisat2 PN:hisat2
VN:2.1.0 CL:"/sw/eb/software/HISAT2/2.1.0-intel-2017A-Python-2.7.12/bin/hisat2-align-s --wrapper basic-0 -p 20 -x /scratch/user/guotingfeng/RNAseq/genomereference/genomereference -q -S out.sam -1 /tmp/2611.inpipe1 -2 /tmp/2611.inpipe2"

K00333:90:H3TMTBBXY:2:1101:20070:1947 77 * 0 0 * * 0 0 CCTGTGTAATCTGCCCGCGGGCGGGTCATCCGGAGATTGTTAAACTGCTCAAGAATATTAAAGAAGCAGGCTGAC YT:Z:UP

K00333:90:H3TMTBBXY:2:1101:20070:1947 141 * 0 0 * * 0 0 NGCATTACTGAATAAATCGGTGATGTTAAAGGGACTCTTCACGTAGGAACTTATAAAAGGGTAAGTACAATTTTG YT:Z:UP

Report is still same,

25854795 reads; of these:

25854795 (100.00%) were paired; of these:

**25854795 (100.00%) aligned concordantly 0 times**

0 (0.00%) aligned concordantly exactly 1 time

0 (0.00%) aligned concordantly >1 times

I found SN:0 in the Sam file, is that normal? Does it mean there is some problem for my genome reference data or index file? How can I check that? Thank you very much. Have a good night.

ADD REPLY • link 5.1 years ago by guotingfeng ▴ 10

score 1 · Answer 1 · 2019-02-26

1

Entering edit mode

5.1 years ago

manuel.belmadani ★ 1.3k

I see two possible options. I think most likely there's an issue with your hisat2-build command. The -c switch expects the sequences to be provided by the standard input, not from a file. It looks like you're providing a FASTA file, so that would not be needed. Here's my hisat2-build command:

"$HISAT_PATH"/"hisat2-build" "$HISAT_REFERENCE_FASTA" "$ASSEMBLY_OUT"/"transcriptome"

In your case, you probably want just

hisat2-build -p $threads $genome_index_prefix $pe_1

Otherwise when you run hisat, I'm not sure but I think the -q switch is not essential.

-q Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files usually have extension .fq or .fastq. FASTQ is the default format. See also: --solexa-quals and --int-quals.

The documentation isn't too clear, but I believe by default HISAT2 expects .fastq.gz files (which is what you have), and I'm not sure but -q may imply that your file is not gzipped. If you find out, could you let me know whether it matters or not? :)

I would try remaking the index first and see if that fixes it, then try aligning with/without -q and compare (though definitely I've processed .fasta.gz files without it before).

ADD COMMENT • link 5.1 years ago by manuel.belmadani ★ 1.3k

1

Entering edit mode

Awesome, manuel.belmadani. Thank you very much. I will change them and see.

ADD REPLY • link 5.1 years ago by guotingfeng ▴ 10

0

Entering edit mode

Hi manuel.belmadani, I have tried either of changes. -c is the key factors. When I delete it, my reference index size changes a lot and it works now. Thank you for your help. I really appreciate it.

ADD REPLY • link 5.1 years ago by guotingfeng ▴ 10

0

Entering edit mode

Good to hear! You should mark the answer as accepted so others who find this post can know it's resolved.

ADD REPLY • link 5.1 years ago by manuel.belmadani ★ 1.3k