Question

How to properly measure the length of a reference genome

0

Entering edit mode

4.6 years ago

elcortegano ▴ 200

I'm working in a non-model organism for which we do have a reference genome. I was recently learning about the k-mer sprectrum and learned that it could be used to estimate the length of a genome following a pretty straightforward tutorial.

This method however requires to set a length k for the k-mers, and according to some authors, a proper measure should itself depend on the genome length (Chor et la 2009), which is reasonable.

Thus, I run jellyfish following the previous tutorial and estimate the genome length trying both k=10 and k=14, based on the genome length of 120Mb from a closely related species.

The conflict comes with the different outputs. With k=10 the spectrum looks great, as a typical unimodal distribution. The peak of the distribution is also close to the real coverage, so again this is great news. But the genome length estimate is about 3-4Mb!! if I use k=14 the opposite happens: the distribution is all skewed to the left, but the estimate is roughly 128Mb, which is not weird given the organism.

How would you double check which of these outcomes is correct? Another question is, if I wouldn't have this previous idea on the genome length (~120Mb) how would one deal with this problem?

It is not simpler to get a VCF file with base pair resolution and count the number of total positions? isn't this adequate? how do you measure the genome length in practice?

next-gen genome k-mer • 1.9k views

ADD COMMENT • link updated 4.3 years ago by John Marshall 3.0k • written 4.6 years ago by elcortegano ▴ 200

2

Entering edit mode

Here is one more guide for genome size estimation. It says that

The k should be sufficiently large that most of the genome can be distinguished. For most eukaryotic genomes at least 17 are usually used and calculation upto 31 is easiliy doable with Jellyfish.

so perhaps the k you used is too small.

This is a new mapping based approach.

ADD REPLY • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

we do have a reference genome.

What about counting the number of bases in the reference?

It is not simpler to get a VCF file with base pair resolution and count the number of total positions?

How do you think that VCF file was obtained in the first place? Could it be based on the reference genome?

ADD REPLY • link 4.6 years ago by WouterDeCoster 47k

0

Entering edit mode

The reference genome is given in FASTA format, thus it is not that straightforward to count the number of bases.

The VCF is an option for sure, but it also depends on the quality of alignments, and the full reference genome could not be completely represented in the alignment files and consequently in the VCF.

ADD REPLY • link 4.6 years ago by elcortegano ▴ 200

2

Entering edit mode

4.3 years ago

John Marshall 3.0k

Even better and easier to use a tool that understands what's in a FASTA file. For example, index the FASTA file using

samtools faidx yourgenome.fa

and add up the second column of the resulting yourgenome.fa.fai file.

ADD COMMENT • link 4.3 years ago by John Marshall 3.0k

1

Entering edit mode

4.3 years ago

elcortegano ▴ 200

I have noted that answer above counts the end line delimiters as characters, thus inflating genome length.

Code below generates a table of lengths by contig size. The sum of these lengths gives a more accurate estimation of the genome length:

#/bin/sh
FASTA=$1
OUT=$2
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $FASTA | sed ':a;N;$!ba;s/\n/ /g' | sed 's/>/\n/g' | sed '/^[[:space:]]*$/d' | sed 's/[[:space:]]*$//' | tr -s ' ' ',' | sed '1i CHROM,LENGTH' > $OUT

ADD COMMENT • link 4.3 years ago by elcortegano ▴ 200

score 3 · Accepted Answer · 2019-09-26

3

Entering edit mode

4.6 years ago

WouterDeCoster 47k

The reference genome is given in FASTA format, thus it is not that straightforward to count the number of bases.

What?

grep -v '^>' yourgenome.fa | wc -m

ADD COMMENT • link 4.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Well, that's a nice approach, it is very appreciated. I was thinking of a FASTA file with some redundancy between reads, but it's obviously not the case since it is an already reference genome. Sorry for the confusion, and thank you again

ADD REPLY • link 4.6 years ago by elcortegano ▴ 200