Question

hisat2 alignment showing 0% aligned but sam file is 6GB in size.

0

Entering edit mode

2.7 years ago

Priyanka ▴ 10

Hello, I am trying to align my fastq files with the reference genome assembly (MRSA ATCC 33591). The sam file formed is some 5-6 GB in size whereas hisat2 output is showing 0.0% alignment rate.

I am confused why the alignment to reference is zero.

Any suggestions on what I am doing wrong?

hisat2 alignment • 2.9k views

ADD COMMENT • link updated 2.7 years ago by Istvan Albert 100k • written 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

Hard to tell at this point. Please add details, what is the data, what is the reference, what is the read length, what is the command you used? The sam file contains all reads (no matter if unmapped or not), so file size is not an indicator for anything.

ADD REPLY • link 2.7 years ago by ATpoint 82k

0

Entering edit mode

The data is RNA-seq fastq files paired end reads 1_R1_001.fastq.gz and 1_R2_001.fastq.gz with read 150 b read length.

hisat2 indexing cmd:

hisat2-build Staphylococcus_aureus_subsp_aureus_ATCC_33591.fasta MRSA_index

the hisat2 cmd that I have been using:

hisat2 -p 8 -x ../../MRSA_33591_ref/MRSA_index -1 1_R1_001.fastq.gz -2 2_R2_001.fastq.gz -S S1.sam

the output of this cmd is as follows

> 9557190 reads; of these:   9557190 (100.00%) were paired; of these:
>     9372275 (98.07%) aligned concordantly 0 times
>     59908 (0.63%) aligned concordantly exactly 1 time
>     125007 (1.31%) aligned concordantly >1 times
>     ----
>     9372275 pairs aligned concordantly 0 times; of these:
>       19 (0.00%) aligned discordantly 1 time
>     ----
>     9372256 pairs aligned 0 times concordantly or discordantly; of these:
>       18744512 mates make up the pairs; of these:
>         18662286 (99.56%) aligned 0 times
>         33977 (0.18%) aligned exactly 1 time
>         48249 (0.26%) aligned >1 times
> 2.37% overall alignment rate

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

run a fastqc report on your data, see if the data is correct.

also, verify the alignments, create a subset of your reads

    seqkit sample -p 0.1 seqs.fq.gz | seqkit head -n 100000 > subset.fq

then run bwa mem on it.

  bwa index ref.fa
  bwa mem ref.fa subset.fq | samtools sort > aln.bam
  samtools index aln.bam
  samtools flagstat aln.bam

I believe bwa mem, by default it is more sensitive than hisat (even though does not create the proper junctions) but it will tell you

ADD REPLY • link 2.7 years ago by Istvan Albert 100k

0

Entering edit mode

This is Staph aureus so there should be no splicing.

Priyanka : Take some of the reads and blast them at NCBI to make sure your data is what you think it is. Since this is bacterial data close hits may be to other Staph species.

ADD REPLY • link 2.7 years ago by GenoMax 142k

0

Entering edit mode

Hi GenoMax,

I did BLAST a few random reads from different samples but none of them are mapping to S. aureus. In fact they are mapping to Bacillus strain. But the experiment has been done on pure cultures of S. aureus and so finding different species is unusual result.

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

This may indeed be your answer then. Either the sample itself was from wrong organism or there was a sample swap or something during sequencing.

There is no point in continuing this analysis until this issue is addressed.

ADD REPLY • link 2.7 years ago by GenoMax 142k

0

Entering edit mode

Yes, Thank you for the help and useful advice.

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

Hello Albert,

I have performed Fastqc of all of my samples and all the reads are high quality without any adapter contamination.

I tried what you suggested. Following is my flagstat output.

100057 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
57 + 0 supplementary
0 + 0 duplicates
21433 + 0 mapped (21.42% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

As you can see you get a higher alignment (still low all things considered).

Now visualize the alignments in IGV (or eyeball the CIGAR strings and MD tags) how good are these alignments? Do you perhaps have adapters on reads, or the reads are too short in general, or there is some other funny business.

The next step is to find out what could your data be. As GenoMax says, take a few dozen (hundred) of your unmapped reads and blast online.

ADD REPLY • link 2.7 years ago by Istvan Albert 100k

0

Entering edit mode

My alignment has no adapter contamination. Neither does the QC report show any adapter contamination. My per base sequence quality is >30. Even after running Trimmomatic there is no update in alignment rate.

Here is just one region with good reads alignment. There seem to be a lot of mutations in the alignment as well.

This is a RNA-seq study so I am only concerned about the read count. I have also tried pseudo-alignment with Salmon but that too doesn't give any useful result. this is my IGV visualization

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

the image above is quite unusual, you have what seem to be only multi mapped reads (light background) plus you have a very high number of variations per read, and the variations are very close to one another indicating that you are most likely mapping against the wrong genome.

ADD REPLY • link 2.7 years ago by Istvan Albert 100k

0

Entering edit mode

I will try to perform mapping of the reads to all the genus of bacteria to see which other genus they may be mapping to.

Thank you GenoMax, Istvan Albert and ATpoint for all the help. I will let you know if the genus was the same or different once my analysis is complete.

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10

0

Entering edit mode

This is still low, but the slightly increased rate might indicate that this is simply not the genus of bacteria you think it is, and some homology in combination with bwa mems local alignments lead to this alignment %. I am not much into microbiology these days, but if you created these data, did you ensure that the strain is indeed what it is supposed to be?

ADD REPLY • link 2.7 years ago by ATpoint 82k

0

Entering edit mode

I have reverted back to the person who has done the experiment to check if they have used some other strain or genus. They seem to be sure that it was done on pure cultures of S. aureus.

ADD REPLY • link 2.7 years ago by Priyanka ▴ 10