Question: How to properly measure the length of a reference genome
0
gravatar for elcortegano
16 months ago by
elcortegano60
elcortegano60 wrote:

I'm working in a non-model organism for which we do have a reference genome. I was recently learning about the k-mer sprectrum and learned that it could be used to estimate the length of a genome following a pretty straightforward tutorial.

This method however requires to set a length k for the k-mers, and according to some authors, a proper measure should itself depend on the genome length (Chor et la 2009), which is reasonable.

Thus, I run jellyfish following the previous tutorial and estimate the genome length trying both k=10 and k=14, based on the genome length of 120Mb from a closely related species.

The conflict comes with the different outputs. With k=10 the spectrum looks great, as a typical unimodal distribution. The peak of the distribution is also close to the real coverage, so again this is great news. But the genome length estimate is about 3-4Mb!! if I use k=14 the opposite happens: the distribution is all skewed to the left, but the estimate is roughly 128Mb, which is not weird given the organism.

How would you double check which of these outcomes is correct? Another question is, if I wouldn't have this previous idea on the genome length (~120Mb) how would one deal with this problem?

It is not simpler to get a VCF file with base pair resolution and count the number of total positions? isn't this adequate? how do you measure the genome length in practice?

next-gen k-mer genome • 367 views
ADD COMMENTlink modified 12 months ago by John Marshall2.2k • written 16 months ago by elcortegano60
2

Here is one more guide for genome size estimation. It says that

The k should be sufficiently large that most of the genome can be distinguished. For most eukaryotic genomes at least 17 are usually used and calculation upto 31 is easiliy doable with Jellyfish.

so perhaps the k you used is too small.

This is a new mapping based approach.

ADD REPLYlink modified 16 months ago • written 16 months ago by GenoMax94k

we do have a reference genome.

What about counting the number of bases in the reference?

It is not simpler to get a VCF file with base pair resolution and count the number of total positions?

How do you think that VCF file was obtained in the first place? Could it be based on the reference genome?

ADD REPLYlink written 16 months ago by WouterDeCoster45k

The reference genome is given in FASTA format, thus it is not that straightforward to count the number of bases.

The VCF is an option for sure, but it also depends on the quality of alignments, and the full reference genome could not be completely represented in the alignment files and consequently in the VCF.

ADD REPLYlink written 16 months ago by elcortegano60
3
gravatar for WouterDeCoster
16 months ago by
Belgium
WouterDeCoster45k wrote:

The reference genome is given in FASTA format, thus it is not that straightforward to count the number of bases.

What?

grep -v '^>' yourgenome.fa | wc -m
ADD COMMENTlink written 16 months ago by WouterDeCoster45k

Well, that's a nice approach, it is very appreciated. I was thinking of a FASTA file with some redundancy between reads, but it's obviously not the case since it is an already reference genome. Sorry for the confusion, and thank you again

ADD REPLYlink written 16 months ago by elcortegano60
2
gravatar for John Marshall
12 months ago by
John Marshall2.2k
Glasgow, Scotland
John Marshall2.2k wrote:

Even better and easier to use a tool that understands what's in a FASTA file. For example, index the FASTA file using

samtools faidx yourgenome.fa

and add up the second column of the resulting yourgenome.fa.fai file.

ADD COMMENTlink modified 12 months ago • written 12 months ago by John Marshall2.2k
1
gravatar for elcortegano
12 months ago by
elcortegano60
elcortegano60 wrote:

I have noted that answer above counts the end line delimiters as characters, thus inflating genome length.

Code below generates a table of lengths by contig size. The sum of these lengths gives a more accurate estimation of the genome length:

#/bin/sh
FASTA=$1
OUT=$2
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $FASTA | sed ':a;N;$!ba;s/\n/ /g' | sed 's/>/\n/g' | sed '/^[[:space:]]*$/d' | sed 's/[[:space:]]*$//' | tr -s ' ' ',' | sed '1i CHROM,LENGTH' > $OUT
ADD COMMENTlink written 12 months ago by elcortegano60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2605 users visited in the last hour
_