Question: regarding to Fastqc and SRA file format
gravatar for Rahul
3.4 years ago by
Rahul30 wrote:


I have downloaded SRA data from NCBI and converted it into Fastq file (Pair end sequences), then i analysed the sequences using fastqc. The following results I got, which I think ok. But still confuse. Can anybody shade some light on this aspect.   (Per base sequence quality )

##FastQC	0.11.2
>>Basic Statistics	pass
#Measure	Value
Filename	PPlf.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	24000000
Sequences flagged as poor quality	0
Sequence length	75
%GC	44

Can I use fastq sequences derived from SRA format directly for assembly and scaffolding purpose ? or else i will have to do pre processing like removal of low quality reads,trimming of low quality bases,adapter removal?



ncbi_now • 1.9k views
ADD COMMENTlink modified 3.4 years ago by pevsner420 • written 3.4 years ago by Rahul30

There are many people doing assembly. Be aware that even if it works, your results are probably largely a trash, since, it really is more complicated than alignment. If your work goes easy, that is an indication, that you are doing something very very wrong.

And yes, do the filtering, read the papers and ask questions, preferably email software authors. Aim for software that is used a lot, so you can actually achieve something meaningful

ADD REPLYlink written 3.4 years ago by stolarek.ir580

I thank you very much for your valuable suggestions

ADD REPLYlink written 3.4 years ago by Rahul30
gravatar for pevsner
3.4 years ago by
Kennedy Krieger Institute (Baltimore, MD)
pevsner420 wrote:

Yes, you can use the FASTQ files you download using SRA Toolkit (or FASTQ files you get from any source) for a variety of purposes such as assembly or alignment to a reference genome. Many people consider it a good idea to assess the quality of the reads in your file, and FastQC is a very popular tool for doing this. In the NCBI NOW workshop we show you how to use FastQC (first in Galaxy then on the command line).

In your question you showed the results of the basic statistics output of FastQC. I find that's helpful to see the sequence length (or if there are multiple lengths the shortest and longest are reported) and GC content. Sequences flagged to be filtered and removed by Illumina's Casava are reported in the "Sequences flagged as poor quality" field so that doesn't reflect analysis by a FastQC module and often isn't too informative.

You ask about the plot of quality scores across all bases. The image you link to shows a typical box and whisker plot. (The red line is the median value, the yellow box spans the inter-quartile range (25-75%), the upper and lower whiskers correspond to the 10% and 90% points, and the blue line is the mean quality score.)

In your example the boxes all occur in the "green zone" of excellent quality scores (on the y-axis, range 28-34). The main idea of such plots is that they show the position along each read on the x-axis, and typically the quality drops as one approaches the end of the read. This reflects the inherent limitation of the sequencing chemistry in which calling bases gets more difficult later in reads. For examples of very good and very bad reads see a tutorial here.

If the last position(s) in your reads are of dramatically lower quality then you may wish to trim them. Various programs are available to do this. You can find some in Galaxy in the Tools panel in the section "NGS: QC and manipulation". These are derived from FASTX-Toolkit, a collection of tools for FASTA and FASTQ preprocessing. You can use FASTX-Toolkit on the command line. For de novo assembly (which is not something I'm experienced in) many people remove low quality reads and trim adapter sequences. I suggest you have a look at the FastQC documentation which discusses its variety of useful analysis module outputs including adapter content.

ADD COMMENTlink written 3.4 years ago by pevsner420

Dear sir,

Thanks for showing interest in my post. I am grateful that you took the time to write this post and guiding me.I have trimmed my sequences in between 63-75 bp and ran fastqc over it.I think I am through with the trimming process but still confuse about adapter trimming. If I see the Fastqc report it seems like adapter content is ok.Any help on this problem would be greatly appreciated.  

Fastqc report



ADD REPLYlink written 3.4 years ago by Rahul30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1035 users visited in the last hour