I am working on transcriptome reassembly of four bacterial strains from a study which has been published recently. I have downloaded the .sra files and converted them to .fq using
fastq-dump --split-files, since they are pair-end reads 75pb each. In the paper the authors mention:
Read quality check: Quality analysis of RNA reads was through FastQC tool. From QC reports, data were of very good quality.The manual check of reads showed that some of them still carry Adapters, which were removable through Trimmomatic software [version 0.32]. This software was used to: (i) remove adapters leading and trailing bases of quality less than 3; (ii) scan the read with a 4-base wide sliding window and cutting, when the average quality per base drops below 20; and (iii) filter reads with less than 50 bases after the above steps, and reads with average quality less than 30. Paired-data with both reads surviving the QC process were taken for further analysis.
Expression analysis. BAM files obtained from Tophat were used in Cufflinks [version 2.2.0] program for getting transcript abundance estimates. Differential expression analysis was carried out for all pair-wise comparisons using Cuffdiff (Cufflinks version 2.2.0). To avoid biases, which might had been caused by differential amounts of rRNA and over-represented sequences, transcripts related to rRNA, tRNA, RNAse P and tmRNA were removed.
Accordingly, I used trimmomatic to do some quality pre-processing and filtering as follows:
java -jar ../Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 SRR7276132_1.fastq SRR7276132_2.fastq paired.fq unpaired.fq rev_paired.fq rev_unpaired.fq ILLUMINACLIP:../Trimmomatic-0.38/adapters/TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50
Then I decided to use Hisat2 to perform the alignment to a reference genome downloaded from NCBI. I created the genome indexes and i mapped paired.fq and rev_paired.fq as follows:
hisat2-build genome.fa lpl_ndx hisat2 -x ../ndx/lpl_ndx -1 paired.fq -2 rev_paired.fq -S out.sam
However, any of the four samples has an alignment rate avobe 50% (10-45%). I have also tried tophat, but i get kind of the same rates.
output: 23412494 reads; of these: 23412494 (100.00%) were paired; of these: 14340932 (61.25%) aligned concordantly 0 times 9060649 (38.70%) aligned concordantly exactly 1 time 10913 (0.05%) aligned concordantly >1 times ---- 14340932 pairs aligned concordantly 0 times; of these: 169987 (1.19%) aligned discordantly 1 time ---- 14170945 pairs aligned 0 times concordantly or discordantly; of these: 28341890 mates make up the pairs; of these: 25861567 (91.25%) aligned 0 times 2478255 (8.74%) aligned exactly 1 time 2068 (0.01%) aligned >1 times 44.77% overall alignment rate
I ran FastQC on the raw and trimmed data but the report is quite similar. Sequence Duplication Levels and Overrepresented sequences FAIL but I think that is OK for a RNA-seq analysis. Kmer content also fails but the kmers align on the overrepresented sequences. I posted the full FastQC report from a .fq, all of them look moreless the same.
I was not expecting mapping ranging from 10 to 45%, i though it should be around 90%. I'm quite lost on how I should proceed. I mapped random reads against BLAST ref_genome and it gave me the expected bacterial genome. I also tried different tools to pre-process the data like TrimGalore, or mapping as SE reads, but results do not vary significantly. Do you have any idea of what could be the issue of this low mapping? how should I proceed?