Hisat2 overall alignment rate below 50%
0
0
Entering edit mode
2.5 years ago
Neuls • 0

Dear colleagues,

I am working on transcriptome reassembly of four bacterial strains from a study which has been published recently. I have downloaded the .sra files and converted them to .fq using fastq-dump --split-files, since they are pair-end reads 75pb each. In the paper the authors mention:

Read quality check: Quality analysis of RNA reads was through FastQC tool. From QC reports, data were of very good quality.The manual check of reads showed that some of them still carry Adapters, which were removable through Trimmomatic software [version 0.32]. This software was used to: (i) remove adapters leading and trailing bases of quality less than 3; (ii) scan the read with a 4-base wide sliding window and cutting, when the average quality per base drops below 20; and (iii) filter reads with less than 50 bases after the above steps, and reads with average quality less than 30. Paired-data with both reads surviving the QC process were taken for further analysis.

Expression analysis. BAM files obtained from Tophat were used in Cufflinks [version 2.2.0] program for getting transcript abundance estimates. Differential expression analysis was carried out for all pair-wise comparisons using Cuffdiff (Cufflinks version 2.2.0). To avoid biases, which might had been caused by differential amounts of rRNA and over-represented sequences, transcripts related to rRNA, tRNA, RNAse P and tmRNA were removed.

Accordingly, I used trimmomatic to do some quality pre-processing and filtering as follows:

java -jar ../Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 SRR7276132_1.fastq SRR7276132_2.fastq paired.fq unpaired.fq rev_paired.fq rev_unpaired.fq ILLUMINACLIP:../Trimmomatic-0.38/adapters/TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50


Then I decided to use Hisat2 to perform the alignment to a reference genome downloaded from NCBI. I created the genome indexes and i mapped paired.fq and rev_paired.fq as follows:

hisat2-build genome.fa lpl_ndx
hisat2 -x ../ndx/lpl_ndx -1 paired.fq -2 rev_paired.fq -S out.sam


However, any of the four samples has an alignment rate avobe 50% (10-45%). I have also tried tophat, but i get kind of the same rates.

output:
23412494 (100.00%) were paired; of these:
14340932 (61.25%) aligned concordantly 0 times
9060649 (38.70%) aligned concordantly exactly 1 time
10913 (0.05%) aligned concordantly >1 times
----
14340932 pairs aligned concordantly 0 times; of these:
169987 (1.19%) aligned discordantly 1 time
----
14170945 pairs aligned 0 times concordantly or discordantly; of these:
28341890 mates make up the pairs; of these:
25861567 (91.25%) aligned 0 times
2478255 (8.74%) aligned exactly 1 time
2068 (0.01%) aligned >1 times
44.77% overall alignment rate


I ran FastQC on the raw and trimmed data but the report is quite similar. Sequence Duplication Levels and Overrepresented sequences FAIL but I think that is OK for a RNA-seq analysis. Kmer content also fails but the kmers align on the overrepresented sequences. I posted the full FastQC report from a .fq, all of them look moreless the same.

I was not expecting mapping ranging from 10 to 45%, i though it should be around 90%. I'm quite lost on how I should proceed. I mapped random reads against BLAST ref_genome and it gave me the expected bacterial genome. I also tried different tools to pre-process the data like TrimGalore, or mapping as SE reads, but results do not vary significantly. Do you have any idea of what could be the issue of this low mapping? how should I proceed?

Thank you~

RNA-Seq rna-seq assembly sequencing • 2.1k views
1
Entering edit mode

Some things you might check:

1) Did you use the same reference the authors used?

2) Did you find in the paper any statistics on read mapping?

3) Did you try blasting unmapped reads (or a large number of random reads) to nt or nr and see which organism they belong to? Maybe there is a large amount of contamination (human, phage)?

0
Entering edit mode

I carried on the analysis as similar as I could according to the paper info, which was very limited. The information regarding to the preproceesing and mapping is what you can read in the post 1, 2) they do not mention the reference genome as well as the mapping results of tophat. I blasted like 10 random reads againts NCBI ref_genome and i got the expected genome, the one I am using to map againts. 3) The samples are pure cultures of a bacterial strain, however maybe the phage is possible but such a high degre of phage contamination feels strange

1
Entering edit mode

1
Entering edit mode

Yes, blasting the unmapped reads sounds like the best option. However I cannot understand what do you mean by the "5 most common unmapped reads". Something like clustering together identical unmapped reads and counting how many times they occur in the read file? That's an interesting approach.

0
Entering edit mode

Which paper? Which SRA dataset?

edit: nevermind, I didn't see the SRR number on your Trimmomatic command-line.

edit 2: I edited your post to show one figure within the post. You can further edit your post to show the other images, but I would advise you to shrink the images a bit before doing so.

0
Entering edit mode

The dataset you are using consists of several strains of Lactobacillus plantarum, so it is possible more than one reference genome is needed, one for each isolation source. In addition, extracts from several organisms were used to prepare the culture medium used, so it is possible there are contaminants from these organisms. You could, for example, try to map the Drososophila melanogaster sample to the D. melanogaster genome to see if it accounts for a proportion of the reads.

A side note is bacteria don't have spliced genes and Tophat is not needed, you can map directly with Bowtie2. I never used HISAT, but remember seeing here on BioStars reports of lower mapping rates than Bowtie2.

https://onlinelibrary.wiley.com/doi/full/10.1111/1462-2920.14372