Question: NGS Fastqc explanation
0
gravatar for GHanumanth404
3.2 years ago by
GHanumanth40410 wrote:

Hello Friends !!!!! I am new to biostar community and also in NGS I am facing lot of problem in data analysis of my NGS data Please correct me with following definition. Read length means the number of sequencing cycle is run. Total sequence is the actual length of my genome or target need to be sequenced. reads are bases which are sequenced

if above is correct then in my fastqc file the read length is given as 32-151. if the it means number of cycle then why is giving 32-151

Also can any one explain me fastqc report Per base sequence content Per base sequence content Per sequence quality score Sequence length distribution Kmer content

fastqc read lenght coverage • 1.9k views
ADD COMMENTlink written 3.2 years ago by GHanumanth40410

Welcome to Biostars !

  • Read length - length of the read (DNA fragment) that has been sequenced.

  • Read length : 32-151 - shortest read length - 32 and longest read length - 151 (BTW which instrument was used to generate the data?)

  • Fastqc report explained here

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by venu6.2k

If it means lenght of DNA fragment sequenced then what is total sequence. Does Total sequence means DNA + Adapters ?

ADD REPLYlink written 3.2 years ago by GHanumanth40410

I was confused with 'Total sequence', it is actually Total sequences. From the fastqc manual provided above

Total Sequences: A count of the total number of sequences processed. There are two values reported, actual and estimated. At the moment these will always be the same. In the future it may be possible to analyse just a subset of sequences and estimate the total number, to speed up the analysis, but since we have found that problematic sequences are not evenly distributed through a file we have disabled this for now.

So it is the estimate of total number of reads present in your fastq file. Take 4-5 starting letters from a read id(which are same in all read ids), do the following, which gives the total number of reads present

grep -c '^@HWI' foo.fastq
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by venu6.2k

I think it means total number sequences. Each sequence has different length (here sequence length 31-151) or same length (for example sequence length 150). Am i correct?

ADD REPLYlink written 3.2 years ago by GHanumanth40410

and what is the meaning of The overall %GC of all bases in all sequences. %GC means content in entire genome then what is the meaning of all bases in all sequences

ADD REPLYlink written 3.2 years ago by GHanumanth40410

all bases in all sequences refers to bases that are actually present in your sequence file.

That number should match the value for your genome (unless the sampling was non-uniform or you have contamination).

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax71k

%GC means GC content in my sample i means sequences. Then here all bases means what?? is it compairing with respect to every bases in every position of my sequence?

ADD REPLYlink written 3.2 years ago by GHanumanth40410

Out of the total bases present (A/C/G/T) in your file %GC is percentage of G/C bases (no consideration for their position/location) .

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax71k

Hi,

I have a illumuna MiSeq dataset for a parasite genome. Machine itself gave paired-end reads as two separate datasets. one forward(R1) and other reverse(R2). When using FASTQC tool for one set e.g. filtering reads <70bp in R1 dataset, should we consider R1 as paired-end or no?

Thanks

ADD REPLYlink written 3.0 years ago by sumudu_rangika30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1622 users visited in the last hour