Question: FASTQ alignment result is really bad
0
gravatar for bharata1803
4 months ago by
bharata1803420
Japan
bharata1803420 wrote:

So, I downloaded sra and then converted to fastq from this link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99671

I have tried salmon and bwa+htseq-count to align and get readcount. Both of the tools are failed to get any result. The salmon result is only 2% mapped and htseq-count cannot even map a single reads to transcript.

My question is, what do you think happen here? Does the sequencing machine affect this? I noticed the platform to sequence is not illumina. The original author use Lifescope to to do the RNA-seq workflow. Anyway to investigate this? I attached Fastqc result

Update:

I have tried to use bowtie to index reference genome in colorspace. The command is this:

bowtie-build -C --threads 18 HG38_90.fa hg38_90_idx

I aligned the file with this command:

bowtie --chunkmbs 500 -C -p 18 -S hg38_90_idx -1 fastq/SRR5644749_1.fastq -2 fastq/SRR5644749_2.fastq SRR5644749.sam

And the result is really bad.

# reads processed: 20826355
# reads with at least one reported alignment: 17671 (0.08%)
# reads that failed to align: 20808684 (99.92%)

Any suggestion how to align in colorspace?

enter image description here enter image description here enter image description here enter image description here

rna-seq salmon htseq-count • 375 views
ADD COMMENTlink modified 4 months ago by h.mon24k • written 4 months ago by bharata1803420

There must something seriously wrong with your setup, it's just hard to tell what. Even with bwa you should be able to achieve >60% aligned reads. However, please change to a a state-of-the-art pipeline first, using either Hisat(2) or STAR as an aligner, then HTseq-count or FeatureCount. Please check that you have the correct and matching versions of the genome or transcriptome reference (for using salmon) and genome annotation.

If you need more help, please also do and report QC at each step using FastQC and MultiQC, and please provide the exact commands run as well as relevant output of fastqc and multiQC.

ADD REPLYlink written 4 months ago by Michael Dondrup46k

I have provided the Fastqc report. It seems there are problems with the fastq file itself.

ADD REPLYlink written 4 months ago by bharata1803420

It is rare for current RNA-seq protocols to see failed sequence quality and N content. Check how the fastq extraction went well, just download and extract again using SRA toolkit. Then possibly you need to apply quality trimming or select a better dataset.

ADD REPLYlink modified 4 months ago • written 4 months ago by Michael Dondrup46k

I have the sra file. It seems there were no problem when I use sra toolkit fastqdump.

ADD REPLYlink written 4 months ago by bharata1803420

Oh, these could be colorspace data, they need to be analysed differently! Try option -c with bwa and see if that improves the alignment. See http://seqanswers.com/forums/showthread.php?t=16621

ADD REPLYlink modified 4 months ago • written 4 months ago by Michael Dondrup46k

I think the latest bwa options are different. In bwa sampe, -c is about: -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]

I am not familiar with colorspace data.

ADD REPLYlink written 4 months ago by bharata1803420

Neither am I, sorry, but maybe you can download an older version of bwa that supports colorspace.

ADD REPLYlink written 4 months ago by Michael Dondrup46k

I will try bowtie. It seems bowtie support colorspace

ADD REPLYlink written 4 months ago by bharata1803420

Please read more about Solid data conversion to fastq here. You probably lose information with the conversion to fastq.

ADD REPLYlink written 4 months ago by Benn6.6k

I am still confused. I have tried using bowtie to make colorspace index and align in colorspace, but it is still almost 0% reads that can be mapped to the reference. It is really weird.

ADD REPLYlink written 4 months ago by bharata1803420

Can you show with head how your fastq file looks like? Is it really in color space?

ADD REPLYlink written 4 months ago by Benn6.6k

It is in colorspace. I have checked it. I will post the head result later.

ADD REPLYlink written 4 months ago by bharata1803420

Just to put aside anything related with SRA, you can directly download fastq files for your project here: https://www.ebi.ac.uk/ena/data/view/PRJNA389279][1]

I have looked at one file and indeed it looks like color-space:

@SRR5644749.11183773 11183773/2
T22033022222231202002113112222221030032031212322000
+
!33333333333.--------------------------------------
@SRR5644749.11183774 11183774/2
T0320120300100101212000302000232130232310..........
+
!=/////////....--------------------------**********
@SRR5644749.11183775 11183775/2
T001111001023332010010.............................
+
!@?8886666............*****************************
ADD REPLYlink written 4 months ago by jomo018470
0
gravatar for h.mon
4 months ago by
h.mon24k
Brazil
h.mon24k wrote:

I think the problem is the data is really crappy, and Bowtie is discarding the reads because of too many errors at the seed region. I mapped the reads with Subread and got an average mapping rate of 65.9% for the same file (SRR5644749) you mapped. Even so, there were a lot of orphaned pairs:

//================================= Summary ==================================\\
||                                                                            ||
||     Total fragments : 20,826,355                                           ||
||              Mapped : 13,718,668 (65.9%)                                   ||
||     Uniquely mapped : 11,529,522                                           ||
||       Multi-mapping : 2,189,146                                            ||
||                                                                            ||
||            Unmapped : 7,107,687                                            ||
||                                                                            ||
||    Correctly paired : 6,067,553                                            ||
|| Not mapped in pairs : 7,651,115                                            ||
|| Only one end mapped : 7,109,909                                            ||
||   Multi-chromosomes : 265,002                                              ||
||   Different strands : 290,108                                              ||
||  Not in PE distance : 60,605                                               ||
||      Abnormal order : 33,402                                               ||
||                                                                            ||
||              Indels : 0                                                    ||
||                                                                            ||
||        Running time : 23.6 minutes                                         ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Subread was run with:

subread-buildindex -B -c -F -o genome genome.fasta
subread-align -b -t 0 -i genome --multiMapping -B 2 \
    -o SRR5644749.bam -r SRR5644749_1.fastq.gz -R SRR5644749_2.fastq.gz

Probably tweaking Subread settings should result in a higher mapping rate.

You can use FastQC on the bam file and use the -f bam_mapped flag to evaluate only mapped reads, this will produce a report ignoring unmapped read. The problem with unmapped reads is FastQC converts colorspace to basespace, but this conversion is unreliable without a reference. Mapped reads should be of good quality, and already corrected by the mapper.

P.S.: note that your bowtie mapping command is not outputting a sam file, you need to use the -S flag with Bowtie, because it has a custom default output. Initially, I didn't see the -S in your command.

ADD COMMENTlink modified 4 months ago • written 4 months ago by h.mon24k

It is a bit hard to check the quality because FastQC cannot be used for colorspace. Is there any fastq quality control software that can handle the colorspace data? If you got 65% reads, I think it is good enough. I will try using subread. Do you mind posting you command with subread so that I can reproduce it?

ADD REPLYlink written 4 months ago by bharata1803420

I updated my answer to include the subread command, also included further comments on FastQC.

ADD REPLYlink written 4 months ago by h.mon24k

Thanks, I will try your method. I have tried subread but I got 20% mapped reads which is quite bad. If I can get 60% like you probably the data can be used.

ADD REPLYlink written 4 months ago by bharata1803420

If you got 20% from the same sample, then there is something odd - we should get the same mapping rate. But if you got 20% from other samples, this would be an indication there are samples with even worst quality than the one you posted about (and in fact, for a second sample, I got 16% mapping rate).

ADD REPLYlink written 4 months ago by h.mon24k

I realized I aligned it to cdna/transcript reference, not whole genome reference. I aligned to cdna reference because I wanted try to count the reads with salmon. I will try to align it to the genome then.

ADD REPLYlink written 4 months ago by bharata1803420

I gave up with this dataset, the fastq quality seems really bad. I have checked their processed data too and try to use DESeq2 to get the differential expression. The result is bad too with only small number of genes are giving siginificant result.

ADD REPLYlink written 4 months ago by bharata1803420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour