FASTQ alignment result is really bad
1
0
Entering edit mode
3.1 years ago
bharata1803 ▴ 530

I have tried salmon and bwa+htseq-count to align and get readcount. Both of the tools are failed to get any result. The salmon result is only 2% mapped and htseq-count cannot even map a single reads to transcript.

My question is, what do you think happen here? Does the sequencing machine affect this? I noticed the platform to sequence is not illumina. The original author use Lifescope to to do the RNA-seq workflow. Anyway to investigate this? I attached Fastqc result

Update:

I have tried to use bowtie to index reference genome in colorspace. The command is this:

bowtie-build -C --threads 18 HG38_90.fa hg38_90_idx


I aligned the file with this command:

bowtie --chunkmbs 500 -C -p 18 -S hg38_90_idx -1 fastq/SRR5644749_1.fastq -2 fastq/SRR5644749_2.fastq SRR5644749.sam


And the result is really bad.

# reads processed: 20826355
# reads with at least one reported alignment: 17671 (0.08%)
# reads that failed to align: 20808684 (99.92%)


Any suggestion how to align in colorspace?

RNA-Seq htseq-count salmon • 1.7k views
0
Entering edit mode

There must something seriously wrong with your setup, it's just hard to tell what. Even with bwa you should be able to achieve >60% aligned reads. However, please change to a a state-of-the-art pipeline first, using either Hisat(2) or STAR as an aligner, then HTseq-count or FeatureCount. Please check that you have the correct and matching versions of the genome or transcriptome reference (for using salmon) and genome annotation.

If you need more help, please also do and report QC at each step using FastQC and MultiQC, and please provide the exact commands run as well as relevant output of fastqc and multiQC.

0
Entering edit mode

I have provided the Fastqc report. It seems there are problems with the fastq file itself.

0
Entering edit mode

It is rare for current RNA-seq protocols to see failed sequence quality and N content. Check how the fastq extraction went well, just download and extract again using SRA toolkit. Then possibly you need to apply quality trimming or select a better dataset.

0
Entering edit mode

I have the sra file. It seems there were no problem when I use sra toolkit fastqdump.

0
Entering edit mode

Oh, these could be colorspace data, they need to be analysed differently! Try option -c with bwa and see if that improves the alignment. See http://seqanswers.com/forums/showthread.php?t=16621

0
Entering edit mode

I think the latest bwa options are different. In bwa sampe, -c is about: -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]

I am not familiar with colorspace data.

0
Entering edit mode

Neither am I, sorry, but maybe you can download an older version of bwa that supports colorspace.

0
Entering edit mode

I will try bowtie. It seems bowtie support colorspace

0
Entering edit mode

Please read more about Solid data conversion to fastq here. You probably lose information with the conversion to fastq.

0
Entering edit mode

I am still confused. I have tried using bowtie to make colorspace index and align in colorspace, but it is still almost 0% reads that can be mapped to the reference. It is really weird.

0
Entering edit mode

Can you show with head how your fastq file looks like? Is it really in color space?

0
Entering edit mode

It is in colorspace. I have checked it. I will post the head result later.

0
Entering edit mode

Just to put aside anything related with SRA, you can directly download fastq files for your project here: https://www.ebi.ac.uk/ena/data/view/PRJNA389279][1]

I have looked at one file and indeed it looks like color-space:

@SRR5644749.11183773 11183773/2
T22033022222231202002113112222221030032031212322000
+
!33333333333.--------------------------------------
@SRR5644749.11183774 11183774/2
T0320120300100101212000302000232130232310..........
+
!=/////////....--------------------------**********
@SRR5644749.11183775 11183775/2
T001111001023332010010.............................
+
!@?8886666............*****************************

0
Entering edit mode
3.1 years ago
h.mon 33k

I think the problem is the data is really crappy, and Bowtie is discarding the reads because of too many errors at the seed region. I mapped the reads with Subread and got an average mapping rate of 65.9% for the same file (SRR5644749) you mapped. Even so, there were a lot of orphaned pairs:

//================================= Summary ==================================\\
||                                                                            ||
||     Total fragments : 20,826,355                                           ||
||              Mapped : 13,718,668 (65.9%)                                   ||
||     Uniquely mapped : 11,529,522                                           ||
||       Multi-mapping : 2,189,146                                            ||
||                                                                            ||
||            Unmapped : 7,107,687                                            ||
||                                                                            ||
||    Correctly paired : 6,067,553                                            ||
|| Not mapped in pairs : 7,651,115                                            ||
|| Only one end mapped : 7,109,909                                            ||
||   Multi-chromosomes : 265,002                                              ||
||   Different strands : 290,108                                              ||
||  Not in PE distance : 60,605                                               ||
||      Abnormal order : 33,402                                               ||
||                                                                            ||
||              Indels : 0                                                    ||
||                                                                            ||
||        Running time : 23.6 minutes                                         ||
||                                                                            ||


subread-buildindex -B -c -F -o genome genome.fasta
subread-align -b -t 0 -i genome --multiMapping -B 2 \
-o SRR5644749.bam -r SRR5644749_1.fastq.gz -R SRR5644749_2.fastq.gz


Probably tweaking Subread settings should result in a higher mapping rate.

You can use FastQC on the bam file and use the -f bam_mapped flag to evaluate only mapped reads, this will produce a report ignoring unmapped read. The problem with unmapped reads is FastQC converts colorspace to basespace, but this conversion is unreliable without a reference. Mapped reads should be of good quality, and already corrected by the mapper.

P.S.: note that your bowtie mapping command is not outputting a sam file, you need to use the -S flag with Bowtie, because it has a custom default output. Initially, I didn't see the -S in your command.

0
Entering edit mode

It is a bit hard to check the quality because FastQC cannot be used for colorspace. Is there any fastq quality control software that can handle the colorspace data? If you got 65% reads, I think it is good enough. I will try using subread. Do you mind posting you command with subread so that I can reproduce it?

0
Entering edit mode

0
Entering edit mode

Thanks, I will try your method. I have tried subread but I got 20% mapped reads which is quite bad. If I can get 60% like you probably the data can be used.

0
Entering edit mode

If you got 20% from the same sample, then there is something odd - we should get the same mapping rate. But if you got 20% from other samples, this would be an indication there are samples with even worst quality than the one you posted about (and in fact, for a second sample, I got 16% mapping rate).

0
Entering edit mode

I realized I aligned it to cdna/transcript reference, not whole genome reference. I aligned to cdna reference because I wanted try to count the reads with salmon. I will try to align it to the genome then.

0
Entering edit mode

I gave up with this dataset, the fastq quality seems really bad. I have checked their processed data too and try to use DESeq2 to get the differential expression. The result is bad too with only small number of genes are giving siginificant result.