Why my Hisat2 result is 0% aligned concordantly and HTseq result is 100% not aligned.
1
0
Entering edit mode
5.1 years ago
guotingfeng ▴ 10

Hi dear all, I have several confused result for my RNAseq data alignment and analysis. I used Hisat2 to map my Bacillus 168 RNAseq data into downloaded genome (fasta) from NCBI. I found all none of alignment is concordantly. When I used this output to counting the number of read through HTseq, the report said I only got not-aligned read. I changed several parameters, but it didn't improve. Could you help me please? I will write down my scripts and process. Thank you very much.

1. I downloaded genome from NCBI and used Hisat2-building to make genome index.

module load HISAT2/2.1.0-intel-2017A-Python-2.7.12
pe_1="genomereference"
genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/sequence.fasta'

**hisat2-build -p $threads -f -c $genome_index_prefix $pe_1**

Then I got these 8 files:

genomereference.1.ht2
genomereference.2.ht2
genomereference.3.ht2
genomereference.4.ht2
genomereference.5.ht2
genomereference.6.ht2
genomereference.7.ht2
genomereference.8.ht2

2. I used my fastq.zg data and genome index to do the mapping through Hisat2 and samtools.

module load HISAT2/2.1.0-intel-2017A-Python-2.7.12
module load SAMtools/1.6-GCCcore-6.3.0

pe_1='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R1_001.fastq.gz'
pe_2='/scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R2_001.fastq.gz'
genome_index_prefix='/scratch/user/guotingfeng/RNAseq/genomereference/genomereference'
id='1Mggroup'
library='bacillus'
platform='ILLUMINA'
sample='1Mg'
output_bam="/scratch/user/guotingfeng/RNAseq/Hisat2/${sample}_pe_aln.sam"

**hisat2 --dta -p $threads --rg-id "$id" --rg "LB:$library" --rg "SM:$sample" --rg "PL:$platform" -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam**

**samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T $sample -n -o $output_bam**

rm out.sam

The output report is wird here:

  **25854795 reads; of these:**

     25854795 (100.00%) were paired; of these:

      **25854795 (100.00%) aligned concordantly 0 times,**

    0 (0.00%) aligned concordantly exactly 1 time,

    0 (0.00%) aligned concordantly >1 times.**

    ----
    25854795 pairs aligned concordantly 0 times; of these:

      0 (0.00%) aligned discordantly 1 time,

    ----
    25854795 pairs aligned 0 times concordantly or discordantly; of these:

      51709590 mates make up the pairs; of these:

        51709590 (100.00%) aligned 0 times,

        0 (0.00%) aligned exactly 1 time,

        0 (0.00%) aligned >1 times.

0.00% overall alignment rate.

3. I still used this output to do the counting by HTseq-count.

module load HTSeq/0.9.1-intel-2017A-Python-2.7.12

pe1_1='/scratch/user/guotingfeng/RNAseq/Hisat2/1_pe_aln.sam'
pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf'
sample='/scratch/user/guotingfeng/RNAseq/HTseq/Hisat2/bwamem_1sam.txt'

**htseq-count -f sam -s yes -m union -r name -t exon -i gene_id --additional-attr=gene_name $pe1_1 $pe1_2 > $sample**

The output is weird: I didn't get any read in each genes and it said all the reads are not aligned.

**gene:BSU38850   yxkC    0**

**gene:BSU38860   galE    0**

**gene:BSU38870   yxkA    0**

**gene:BSU38880   yxjO    0**

**gene:BSU38890   yxjN    0**


__no_feature            0

__ambiguous             0

__too_low_aQual         0

**__not_aligned           24291367**

__alignment_not_unique          0

So anyone have any ideas? Thank you very much. Have a good day.

RNA-Seq alignment Hisat2 HTseq • 3.9k views
ADD COMMENT
0
Entering edit mode

The HTSeq part makes sense; that it's not reporting anything if 0% of reads aligned. Can you try to rerun HISAT2 with the most basic parameters? e.g. something like this to start:

hisat2 -p $threads -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam

If that works, I would try adding your other parameters one by one (--dta, then read groups, platform etc.) and see what causes the issue.

ADD REPLY
0
Entering edit mode

Hi manuel.belmadani, Thank you for your reply.

I did as what you said to delete all other parameters and rerun the program. The result is totally same.

hisat2 -p $threads -x $genome_index_prefix -q -1 $pe_1 -2 $pe_2 -S out.sam

samtools view -h -bS out.sam | samtools sort -m 2G -@ $threads -T $sample -n -o $output_bam

The output of Sam is here:

@HD VN:1.0 SO:coordinate @SQ SN:0 LN:33 @PG ID:hisat2 PN:hisat2
VN:2.1.0 CL:"/sw/eb/software/HISAT2/2.1.0-intel-2017A-Python-2.7.12/bin/hisat2-align-s --wrapper basic-0 -p 20 -x /scratch/user/guotingfeng/RNAseq/genomereference/genomereference -q -S out.sam -1 /tmp/2611.inpipe1 -2 /tmp/2611.inpipe2"

K00333:90:H3TMTBBXY:2:1101:20070:1947 77 * 0 0 * * 0 0 CCTGTGTAATCTGCCCGCGGGCGGGTCATCCGGAGATTGTTAAACTGCTCAAGAATATTAAAGAAGCAGGCTGAC YT:Z:UP

K00333:90:H3TMTBBXY:2:1101:20070:1947 141 * 0 0 * * 0 0 NGCATTACTGAATAAATCGGTGATGTTAAAGGGACTCTTCACGTAGGAACTTATAAAAGGGTAAGTACAATTTTG YT:Z:UP

Report is still same,

25854795 reads; of these:

25854795 (100.00%) were paired; of these:

**25854795 (100.00%) aligned concordantly 0 times**

0 (0.00%) aligned concordantly exactly 1 time

0 (0.00%) aligned concordantly >1 times

I found SN:0 in the Sam file, is that normal? Does it mean there is some problem for my genome reference data or index file? How can I check that? Thank you very much. Have a good night.

ADD REPLY
1
Entering edit mode
5.1 years ago

I see two possible options. I think most likely there's an issue with your hisat2-build command. The -c switch expects the sequences to be provided by the standard input, not from a file. It looks like you're providing a FASTA file, so that would not be needed. Here's my hisat2-build command:

"$HISAT_PATH"/"hisat2-build" "$HISAT_REFERENCE_FASTA" "$ASSEMBLY_OUT"/"transcriptome"

In your case, you probably want just

hisat2-build -p $threads $genome_index_prefix $pe_1

Otherwise when you run hisat, I'm not sure but I think the -q switch is not essential.

-q Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files usually have extension .fq or .fastq. FASTQ is the default format. See also: --solexa-quals and --int-quals.

The documentation isn't too clear, but I believe by default HISAT2 expects .fastq.gz files (which is what you have), and I'm not sure but -q may imply that your file is not gzipped. If you find out, could you let me know whether it matters or not? :)

I would try remaking the index first and see if that fixes it, then try aligning with/without -q and compare (though definitely I've processed .fasta.gz files without it before).

ADD COMMENT
1
Entering edit mode

Awesome, manuel.belmadani. Thank you very much. I will change them and see.

ADD REPLY
0
Entering edit mode

Hi manuel.belmadani, I have tried either of changes. -c is the key factors. When I delete it, my reference index size changes a lot and it works now. Thank you for your help. I really appreciate it.

ADD REPLY
0
Entering edit mode

Good to hear! You should mark the answer as accepted so others who find this post can know it's resolved.

ADD REPLY

Login before adding your answer.

Traffic: 1583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6