Question: coverage of fastq file
1
gravatar for jianzheng934963534
23 months ago by
jianzheng93496353410 wrote:

Hi all. I download illumina fastq file via ncbi sratoolkit . Now I calculate the coverage by calculating average depth with samtools

But the answer is like 16.12 or 18.07, so I guess the coverage may be x20

Is there any information I can directly get the coverage information with the access number(like SRR065390),

Thanks for your help!

ADD COMMENTlink modified 23 months ago by Istvan Albert ♦♦ 80k • written 23 months ago by jianzheng93496353410

You can check if the information is available at NCBIs SRA (https://www.ncbi.nlm.nih.gov/sra). If not, you have to align the data and then calculate coverage with e.g. BEDtools genomecov or SAMtools depth. Simply taking the number of reads in the fastq, together with the fragment length and genome size to get an average "theoretical" coverage is not really safe, as the run might contain a notable number of unmappable reads, which do not add coverage informaton.

ADD REPLYlink modified 23 months ago • written 23 months ago by ATpoint17k

I know there may be problem to do so, but I really can not find the coverage information in NCBIs SRA. By calculating the coverage I can only get average depth in float number, but in many papers they mention the data like 20x, the coverage is an integer. So I am quite confused about it.

ADD REPLYlink written 23 months ago by jianzheng93496353410

If coverage has been mentioned without any alignments then it is most likely "theoretical" coverage relative to genome size in bases.

ADD REPLYlink written 23 months ago by genomax68k

Is there any information I can directly get the coverage information with the access number(like SRR065390),

Short answer is no. Data in SRA is generally going to be raw fastq reads.

ADD REPLYlink written 23 months ago by genomax68k

But in some papers they mention the coverage of these fastq reads, and it is ususally an integer, I wonder how they get it.

ADD REPLYlink written 23 months ago by jianzheng93496353410

...by alignment followed by coverage calculation, as stated above, or by naively calculating it based on read count, fragment length and genome size (which is not precise at all).

ADD REPLYlink written 23 months ago by ATpoint17k

The main problem is that those papers give the coverage as an integer(such as 20x) but by calculating I can only get a float number, not exactly, if it is 18.75(for example), I can guess that the coverage may be 20x. So I think there is informaion about the coverage of the fastq read(through ncbi) But I go through the whole page and can not find such information. Does any one know where I can find such information. Thanks!

ADD REPLYlink written 23 months ago by jianzheng93496353410

Coverage is a fuzzy entity to some extent. It may hold more credibility for prokaryotic genomes where the entire genome is sequenced. For many eukaryotes parts of the genome (centromeres/telomeres etc) are still not unknown.

ADD REPLYlink written 23 months ago by genomax68k

if it is 18.75(for example), I can guess that the coverage may be 20x

If it's 18.75 then it's 18.75 and not 20x. Strange that you think your guess is better than a calculated number.

ADD REPLYlink written 23 months ago by WouterDeCoster39k

I mean that in some papers they mention the file is 20x, by calculating I get 18.75, so how do they get that the fastq file is 20x?

ADD REPLYlink written 23 months ago by jianzheng93496353410

They might mention that the genome was covered minimally at 20x. Or that they only consider the regions which were covered at at least 20x and discard the rest. Or that they aimed for 20x.

Also: a fastq file doesn't have a coverage. You can only calculate that after aligning the file to the genome.

ADD REPLYlink written 23 months ago by WouterDeCoster39k
3
gravatar for Istvan Albert
23 months ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

You can't accurately predict the coverage before aligning reads - but if the genome size is known you could estimate the value:

(number of reads) x (lenght of each read) / (length of the genome)

For paired end reads double the number of reads (that is account for both pairs).

For data of reasonable quality, this estimate is usually fairly accurate within 10-15%.

ADD COMMENTlink written 23 months ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1740 users visited in the last hour