Question: Hisat2 overall alignment rate below 50%
gravatar for Neuls
21 months ago by
Neuls0 wrote:

Dear colleagues,

I am working on transcriptome reassembly of four bacterial strains from a study which has been published recently. I have downloaded the .sra files and converted them to .fq using fastq-dump --split-files, since they are pair-end reads 75pb each. In the paper the authors mention:

Read quality check: Quality analysis of RNA reads was through FastQC tool. From QC reports, data were of very good quality.The manual check of reads showed that some of them still carry Adapters, which were removable through Trimmomatic software [version 0.32]. This software was used to: (i) remove adapters leading and trailing bases of quality less than 3; (ii) scan the read with a 4-base wide sliding window and cutting, when the average quality per base drops below 20; and (iii) filter reads with less than 50 bases after the above steps, and reads with average quality less than 30. Paired-data with both reads surviving the QC process were taken for further analysis.

Expression analysis. BAM files obtained from Tophat were used in Cufflinks [version 2.2.0] program for getting transcript abundance estimates. Differential expression analysis was carried out for all pair-wise comparisons using Cuffdiff (Cufflinks version 2.2.0). To avoid biases, which might had been caused by differential amounts of rRNA and over-represented sequences, transcripts related to rRNA, tRNA, RNAse P and tmRNA were removed.

Accordingly, I used trimmomatic to do some quality pre-processing and filtering as follows:

java -jar ../Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 SRR7276132_1.fastq SRR7276132_2.fastq paired.fq unpaired.fq rev_paired.fq rev_unpaired.fq ILLUMINACLIP:../Trimmomatic-0.38/adapters/TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50

Then I decided to use Hisat2 to perform the alignment to a reference genome downloaded from NCBI. I created the genome indexes and i mapped paired.fq and rev_paired.fq as follows:

hisat2-build genome.fa lpl_ndx
hisat2 -x ../ndx/lpl_ndx -1 paired.fq -2 rev_paired.fq -S out.sam

However, any of the four samples has an alignment rate avobe 50% (10-45%). I have also tried tophat, but i get kind of the same rates.

23412494 reads; of these:
  23412494 (100.00%) were paired; of these:
    14340932 (61.25%) aligned concordantly 0 times
    9060649 (38.70%) aligned concordantly exactly 1 time
    10913 (0.05%) aligned concordantly >1 times
    14340932 pairs aligned concordantly 0 times; of these:
      169987 (1.19%) aligned discordantly 1 time
    14170945 pairs aligned 0 times concordantly or discordantly; of these:
      28341890 mates make up the pairs; of these:
        25861567 (91.25%) aligned 0 times
        2478255 (8.74%) aligned exactly 1 time
        2068 (0.01%) aligned >1 times
44.77% overall alignment rate

I ran FastQC on the raw and trimmed data but the report is quite similar. Sequence Duplication Levels and Overrepresented sequences FAIL but I think that is OK for a RNA-seq analysis. Kmer content also fails but the kmers align on the overrepresented sequences. I posted the full FastQC report from a .fq, all of them look moreless the same.

img01 img02 img03 img04 img05 img06 img07 img08 img09

I was not expecting mapping ranging from 10 to 45%, i though it should be around 90%. I'm quite lost on how I should proceed. I mapped random reads against BLAST ref_genome and it gave me the expected bacterial genome. I also tried different tools to pre-process the data like TrimGalore, or mapping as SE reads, but results do not vary significantly. Do you have any idea of what could be the issue of this low mapping? how should I proceed?

Thank you~

sequencing rna-seq assembly • 1.5k views
ADD COMMENTlink modified 20 months ago by h.mon31k • written 21 months ago by Neuls0

Some things you might check:

1) Did you use the same reference the authors used?

2) Did you find in the paper any statistics on read mapping?

3) Did you try blasting unmapped reads (or a large number of random reads) to nt or nr and see which organism they belong to? Maybe there is a large amount of contamination (human, phage)?

ADD REPLYlink modified 20 months ago • written 20 months ago by Fabio Marroni2.6k

I carried on the analysis as similar as I could according to the paper info, which was very limited. The information regarding to the preproceesing and mapping is what you can read in the post 1, 2) they do not mention the reference genome as well as the mapping results of tophat. I blasted like 10 random reads againts NCBI ref_genome and i got the expected genome, the one I am using to map againts. 3) The samples are pure cultures of a bacterial strain, however maybe the phage is possible but such a high degre of phage contamination feels strange

ADD REPLYlink written 20 months ago by Neuls0

Instead of blasting random reads, why not blast the 5 most common unmapped reads?

ADD REPLYlink written 20 months ago by swbarnes28.6k

Yes, blasting the unmapped reads sounds like the best option. However I cannot understand what do you mean by the "5 most common unmapped reads". Something like clustering together identical unmapped reads and counting how many times they occur in the read file? That's an interesting approach.

ADD REPLYlink written 20 months ago by Fabio Marroni2.6k

Which paper? Which SRA dataset?

edit: nevermind, I didn't see the SRR number on your Trimmomatic command-line.

edit 2: I edited your post to show one figure within the post. You can further edit your post to show the other images, but I would advise you to shrink the images a bit before doing so.

ADD REPLYlink modified 20 months ago • written 20 months ago by h.mon31k

The dataset you are using consists of several strains of Lactobacillus plantarum, so it is possible more than one reference genome is needed, one for each isolation source. In addition, extracts from several organisms were used to prepare the culture medium used, so it is possible there are contaminants from these organisms. You could, for example, try to map the Drososophila melanogaster sample to the D. melanogaster genome to see if it accounts for a proportion of the reads.

A side note is bacteria don't have spliced genes and Tophat is not needed, you can map directly with Bowtie2. I never used HISAT, but remember seeing here on BioStars reports of lower mapping rates than Bowtie2.

ADD REPLYlink written 20 months ago by h.mon31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1105 users visited in the last hour