Question

coverage of fastq file

1

Entering edit mode

6.8 years ago

jianzheng934963534 ▴ 20

Hi all. I download illumina fastq file via ncbi sratoolkit . Now I calculate the coverage by calculating average depth with samtools

But the answer is like 16.12 or 18.07, so I guess the coverage may be x20

Is there any information I can directly get the coverage information with the access number(like SRR065390),

Thanks for your help!

genome next-gen sequencing alignment coverage • 8.8k views

ADD COMMENT • link updated 6.8 years ago by Istvan Albert 100k • written 6.8 years ago by jianzheng934963534 ▴ 20

0

Entering edit mode

You can check if the information is available at NCBIs SRA (https://www.ncbi.nlm.nih.gov/sra). If not, you have to align the data and then calculate coverage with e.g. BEDtools genomecov or SAMtools depth. Simply taking the number of reads in the fastq, together with the fragment length and genome size to get an average "theoretical" coverage is not really safe, as the run might contain a notable number of unmappable reads, which do not add coverage informaton.

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

I know there may be problem to do so, but I really can not find the coverage information in NCBIs SRA. By calculating the coverage I can only get average depth in float number, but in many papers they mention the data like 20x, the coverage is an integer. So I am quite confused about it.

ADD REPLY • link 6.8 years ago by jianzheng934963534 ▴ 20

0

Entering edit mode

If coverage has been mentioned without any alignments then it is most likely "theoretical" coverage relative to genome size in bases.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

Is there any information I can directly get the coverage information with the access number(like SRR065390),

Short answer is no. Data in SRA is generally going to be raw fastq reads.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

But in some papers they mention the coverage of these fastq reads, and it is ususally an integer, I wonder how they get it.

ADD REPLY • link 6.8 years ago by jianzheng934963534 ▴ 20

0

Entering edit mode

...by alignment followed by coverage calculation, as stated above, or by naively calculating it based on read count, fragment length and genome size (which is not precise at all).

ADD REPLY • link 6.8 years ago by ATpoint 81k

0

Entering edit mode

The main problem is that those papers give the coverage as an integer(such as 20x) but by calculating I can only get a float number, not exactly, if it is 18.75(for example), I can guess that the coverage may be 20x. So I think there is informaion about the coverage of the fastq read(through ncbi) But I go through the whole page and can not find such information. Does any one know where I can find such information. Thanks!

ADD REPLY • link 6.8 years ago by jianzheng934963534 ▴ 20

0

Entering edit mode

Coverage is a fuzzy entity to some extent. It may hold more credibility for prokaryotic genomes where the entire genome is sequenced. For many eukaryotes parts of the genome (centromeres/telomeres etc) are still not unknown.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

if it is 18.75(for example), I can guess that the coverage may be 20x

If it's 18.75 then it's 18.75 and not 20x. Strange that you think your guess is better than a calculated number.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I mean that in some papers they mention the file is 20x, by calculating I get 18.75, so how do they get that the fastq file is 20x?

ADD REPLY • link 6.8 years ago by jianzheng934963534 ▴ 20

0

Entering edit mode

They might mention that the genome was covered minimally at 20x. Or that they only consider the regions which were covered at at least 20x and discard the rest. Or that they aimed for 20x.

Also: a fastq file doesn't have a coverage. You can only calculate that after aligning the file to the genome.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

score 5 · Answer 1 · 2017-07-15

You can't accurately predict the coverage before aligning reads - but if the genome size is known you could estimate the value:

(number of reads) x (lenght of each read) / (length of the genome)

For paired end reads double the number of reads (that is account for both pairs).

For data of reasonable quality, this estimate is usually fairly accurate within 10-15%.