Q: GenomeScope input and how to interpret the results
1
0
Entering edit mode
2.2 years ago
brooklyn • 0

Hello, I am using GenomeScope to characterize (determine size, ploidy, quality etc) of a de novo green algae genome assembly. My first question is regarding the input to GenomeScope, because the input is using Illumina reads there are both Forward and Reverse sequencing reads but all examples online seem to only be using one input file. Does GenomeScope require both F and R read files as input? Or can you just select one to use? Or do the reads need to be paired as one input together? My second question is if there are any useful tutorials or manuals online on how to interpret the outputted graph profile? Specifically, what does the number of peaks mean? and what is considered a bad amount of error?

Thank you in advance for any assistance to these questions!

Illumina GenomeScope Assembly Genome • 3.0k views
ADD COMMENT
2
Entering edit mode
2.2 years ago
shelkmike ★ 1.2k

The more reads you provide to GenomeScope, the more accurately it will estimate the genome size and other features. GenomeScope doesn't use paired-end information, so you can just combine two files into one

cat reads_R1.fastq reads_R2.fastq >reads.fastq

and then calculate k-mer frequencies for this file.

Regarding the usage of GenomeScope, there is information at https://github.com/schatzlab/genomescope. Also, you can have a look in the paper https://www.nature.com/articles/s41467-020-14998-3

There is always a peak with k-mers that have very low frequency, this peak corresponds to sequencing errors. Also, for a diploid genome you'll see two large peaks that correspond to homozygous k-mers and heterozygous k-mers. For a haploid genome there will be one such peak instead of two. Also, on the far right side of the spectrum you may see small peaks for mitochondrial and plastid genomes. If there were duplications in the genome, you'll see additional peaks.

There is no definite threshold for "bad" amount of errors. Usually, 0.1%-1% of bases in Illumina reads originate from sequencing errors.

ADD COMMENT
0
Entering edit mode

Thank you for your response. I did as suggested and combined my F and R reads to calculate the K-mer frequencies (following the Jellyfish and GenomeScope pipeline on GitHub). I expected my organism to be haploid however I received 2 peaks in the outputted graph. The first bump is noticeably smaller at 20X coverage and the second is at 40X coverage. If I understand correctly, other reasons for the first smaller peak could include 1) sequencing errors, 2) duplication in the genome, or 3) a contaminant organism? GenomeScope also gave me the value uniq:49.9% and dup:0.808%. I received a similar output in BUSCO: C: 94.3%[S:19.8%, D:74.5%] showing I have more duplicated copy of genes than single copies, and my genome size is larger than expected. Trying to figure out if this duplication is worrisome because of technical errors or if it is plausible biological duplication in the genome. Any further advice is appreciated, thank you!

ADD REPLY
1
Entering edit mode

The peak at 20x is definitely not because of sequencing errors. The peak of k-mers that originate from sequencing errors is always at 1x-2x.

I advise to do the following things:

  1. Align reads to the genome and plot a histogram of sequencing coverage. If there is a single peak, this would mean that there is a real duplication in the genome.
  2. If there are two peaks, this would mean that either there is contamination or there were misassemblies that created false duplications.

To check whether the reason is contamination, you can align contigs that have different coverage to the NCBI nt database by Blastn or Megablast and examine the best matches. Maybe you'll find that contigs that have 20x coverage belong to one species and contigs that have 40x coverage to another.

Also, you can try other genome assemblers. I always assemble genomes using at least 5 assemblers and then find which one assembled the genome in the best way.

ADD REPLY
0
Entering edit mode

Thank you again for a great response and for a useful solution! I aligned my reads to the genome however I am having a hard time plotting a histogram to view the entire genome - I tried using 'samtools depth' on my sorted bam files and then used R to plot but it uses a lot of memory and once a plot was generated the file was too large to open. Using BAMstats I am able to view each scaffold separately as a histogram of coverage distribution. A few new questions for you:

  1. Can each scaffold histogram (separately) be used to interpret the same duplication/contamination question? Would we expect the same organism to have equal coverage across the entire genome (can I compare the mean coverage of each scaffold histogram)?
  2. If there are two peaks or uneven coverage within one scaffold histogram - does the same theory apply and two peaks means potential contamination?
  3. Do you have a suggestion how to plot/visualize the whole genome as a histogram?

Thank you again!

ADD REPLY
0
Entering edit mode

You can quickly calculate the coverage histogram for the whole genome with the command

samtools stats --coverage 1,1000,1 alignment.sam | grep ^COV | cut -f 2- >coverage_histogram.txt

samtools stats is a useful tool to calculate various statistics for an alignment.

ADD REPLY

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6