FASTQ alignment result is really bad
1
0
Entering edit mode
5.4 years ago
bharata1803 ▴ 560

So, I downloaded sra and then converted to fastq from this link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99671

I have tried salmon and bwa+htseq-count to align and get readcount. Both of the tools are failed to get any result. The salmon result is only 2% mapped and htseq-count cannot even map a single reads to transcript.

My question is, what do you think happen here? Does the sequencing machine affect this? I noticed the platform to sequence is not illumina. The original author use Lifescope to to do the RNA-seq workflow. Anyway to investigate this? I attached Fastqc result

Update:

I have tried to use bowtie to index reference genome in colorspace. The command is this:

bowtie-build -C --threads 18 HG38_90.fa hg38_90_idx

I aligned the file with this command:

bowtie --chunkmbs 500 -C -p 18 -S hg38_90_idx -1 fastq/SRR5644749_1.fastq -2 fastq/SRR5644749_2.fastq SRR5644749.sam

And the result is really bad.

# reads processed: 20826355
# reads with at least one reported alignment: 17671 (0.08%)
# reads that failed to align: 20808684 (99.92%)

Any suggestion how to align in colorspace?

enter image description here enter image description here enter image description here enter image description here

RNA-Seq htseq-count salmon • 3.4k views
ADD COMMENT
0
Entering edit mode

There must something seriously wrong with your setup, it's just hard to tell what. Even with bwa you should be able to achieve >60% aligned reads. However, please change to a a state-of-the-art pipeline first, using either Hisat(2) or STAR as an aligner, then HTseq-count or FeatureCount. Please check that you have the correct and matching versions of the genome or transcriptome reference (for using salmon) and genome annotation.

If you need more help, please also do and report QC at each step using FastQC and MultiQC, and please provide the exact commands run as well as relevant output of fastqc and multiQC.

ADD REPLY
0
Entering edit mode

I have provided the Fastqc report. It seems there are problems with the fastq file itself.

ADD REPLY
0
Entering edit mode

It is rare for current RNA-seq protocols to see failed sequence quality and N content. Check how the fastq extraction went well, just download and extract again using SRA toolkit. Then possibly you need to apply quality trimming or select a better dataset.

ADD REPLY
0
Entering edit mode

I have the sra file. It seems there were no problem when I use sra toolkit fastqdump.

ADD REPLY
0
Entering edit mode

Oh, these could be colorspace data, they need to be analysed differently! Try option -c with bwa and see if that improves the alignment. See http://seqanswers.com/forums/showthread.php?t=16621

ADD REPLY
0
Entering edit mode

I think the latest bwa options are different. In bwa sampe, -c is about: -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]

I am not familiar with colorspace data.

ADD REPLY
0
Entering edit mode

Neither am I, sorry, but maybe you can download an older version of bwa that supports colorspace.

ADD REPLY
0
Entering edit mode

I will try bowtie. It seems bowtie support colorspace

ADD REPLY
0
Entering edit mode

Please read more about Solid data conversion to fastq here. You probably lose information with the conversion to fastq.

ADD REPLY
0
Entering edit mode

I am still confused. I have tried using bowtie to make colorspace index and align in colorspace, but it is still almost 0% reads that can be mapped to the reference. It is really weird.

ADD REPLY
0
Entering edit mode

Can you show with head how your fastq file looks like? Is it really in color space?

ADD REPLY
0
Entering edit mode

It is in colorspace. I have checked it. I will post the head result later.

ADD REPLY
0
Entering edit mode

Just to put aside anything related with SRA, you can directly download fastq files for your project here: https://www.ebi.ac.uk/ena/data/view/PRJNA389279][1]

I have looked at one file and indeed it looks like color-space:

@SRR5644749.11183773 11183773/2
T22033022222231202002113112222221030032031212322000
+
!33333333333.--------------------------------------
@SRR5644749.11183774 11183774/2
T0320120300100101212000302000232130232310..........
+
!=/////////....--------------------------**********
@SRR5644749.11183775 11183775/2
T001111001023332010010.............................
+
!@?8886666............*****************************
ADD REPLY
0
Entering edit mode
5.4 years ago
h.mon 35k

I think the problem is the data is really crappy, and Bowtie is discarding the reads because of too many errors at the seed region. I mapped the reads with Subread and got an average mapping rate of 65.9% for the same file (SRR5644749) you mapped. Even so, there were a lot of orphaned pairs:

//================================= Summary ==================================\\
||                                                                            ||
||     Total fragments : 20,826,355                                           ||
||              Mapped : 13,718,668 (65.9%)                                   ||
||     Uniquely mapped : 11,529,522                                           ||
||       Multi-mapping : 2,189,146                                            ||
||                                                                            ||
||            Unmapped : 7,107,687                                            ||
||                                                                            ||
||    Correctly paired : 6,067,553                                            ||
|| Not mapped in pairs : 7,651,115                                            ||
|| Only one end mapped : 7,109,909                                            ||
||   Multi-chromosomes : 265,002                                              ||
||   Different strands : 290,108                                              ||
||  Not in PE distance : 60,605                                               ||
||      Abnormal order : 33,402                                               ||
||                                                                            ||
||              Indels : 0                                                    ||
||                                                                            ||
||        Running time : 23.6 minutes                                         ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Subread was run with:

subread-buildindex -B -c -F -o genome genome.fasta
subread-align -b -t 0 -i genome --multiMapping -B 2 \
    -o SRR5644749.bam -r SRR5644749_1.fastq.gz -R SRR5644749_2.fastq.gz

Probably tweaking Subread settings should result in a higher mapping rate.

You can use FastQC on the bam file and use the -f bam_mapped flag to evaluate only mapped reads, this will produce a report ignoring unmapped read. The problem with unmapped reads is FastQC converts colorspace to basespace, but this conversion is unreliable without a reference. Mapped reads should be of good quality, and already corrected by the mapper.

P.S.: note that your bowtie mapping command is not outputting a sam file, you need to use the -S flag with Bowtie, because it has a custom default output. Initially, I didn't see the -S in your command.

ADD COMMENT
0
Entering edit mode

It is a bit hard to check the quality because FastQC cannot be used for colorspace. Is there any fastq quality control software that can handle the colorspace data? If you got 65% reads, I think it is good enough. I will try using subread. Do you mind posting you command with subread so that I can reproduce it?

ADD REPLY
0
Entering edit mode

I updated my answer to include the subread command, also included further comments on FastQC.

ADD REPLY
0
Entering edit mode

Thanks, I will try your method. I have tried subread but I got 20% mapped reads which is quite bad. If I can get 60% like you probably the data can be used.

ADD REPLY
0
Entering edit mode

If you got 20% from the same sample, then there is something odd - we should get the same mapping rate. But if you got 20% from other samples, this would be an indication there are samples with even worst quality than the one you posted about (and in fact, for a second sample, I got 16% mapping rate).

ADD REPLY
0
Entering edit mode

I realized I aligned it to cdna/transcript reference, not whole genome reference. I aligned to cdna reference because I wanted try to count the reads with salmon. I will try to align it to the genome then.

ADD REPLY
0
Entering edit mode

I gave up with this dataset, the fastq quality seems really bad. I have checked their processed data too and try to use DESeq2 to get the differential expression. The result is bad too with only small number of genes are giving siginificant result.

ADD REPLY

Login before adding your answer.

Traffic: 2049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6