Question

How to select the right reference genome

0

Entering edit mode

19 months ago

rj.rezwan • 0

Hi, I have Illumina PE sequencing data of different particular plant species accessions and I am interested in mapping the reads to the available genome and later on variant calling. For reference genome, there are two files available for the reference genome, i.e., one file has the scaffold data having 1.3 GB data (https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/176265/), while another file has the genome assembly data (~339 MB) (http://www.pitayagenomic.com/download.php). So suggest to me which one I should use as a reference genome for the mapping the reads.

genome reads mapping diploid • 1.0k views

ADD COMMENT • link 19 months ago by rj.rezwan • 0

score 1 · Answer 1 · 2022-09-22

339 MB is not a genome length, its a size of a FASTA file compressed with gzip. I suppose, after you decompress it, the genome size you'll see will be approximately 1.3 Gbp.

These two genome assemblies were made by different laboratories and published in the same year in the same journal: https://www.nature.com/articles/s41438-021-00501-6 and https://www.nature.com/articles/s41438-021-00612-0. These assemblies have similar scaffold N50, but the one from http://www.pitayagenomic.com/download.php has 19 times larger contig N50, so I suppose it's more accurate since scaffolding of short contigs is error-prone.

score 0 · Answer 2 · 2022-09-22

0

Entering edit mode

19 months ago

Istvan Albert 100k

Note: This question should be titled: how do I select the proper reference genome. (so I have changed the title as a moderator)

To which the answer is that you have to assess the completeness and quality of each genome and then think about which one you think is more suitable for your needs. Read up on publications that talk about the differences and tradeoffs.

In the next step, create a modular pipeline so you can rerun your analysis with minimal fuss with both genomes.

Now you can evaluate and characterize the anticipated differences and the observed ones.