Complete genome, simulated reads of any organism for which complete genome is already determined
1
0
Entering edit mode
7.2 years ago
saranpons3 ▴ 70

Hello All, I am trying to take complete genome of any organism and give it as an input to my perl script which will generate reads of same length randomly from the input. Once reads are generated, i will give the reads as an input to the prototype assembler which i designed and try to assemble. Here, in my assembler, i'm not implementing to remove tips, bubbles from de bruijn graph which is constructed from the reads (Later on, i will try to code to remove tips, bubbles and etc.,)

For testing my assembler, First I downloaded the complete genome of Enterobacteria phage lambda from https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa and generated the random reads from the genome using my perl script. Then, run my assembler on the simulated reads. My assembler successfully assembled to the original genome from the simulated reads.

In the complete genome file of lambda virus https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa, I could see only one line starts with the character '>'.

For another experiment, I thought of taking a bigger complete genome of any organism and test with my assembler. So I downloaded the complete genome of Drosophila melanogaster from this file GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna file(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz).

Like Lambda virus complete genome file, So I expected even this file GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna will have only one line will start with the character '>'. But there are many lines start with '>'. Also, I could see many A,T,C,G are in lower case letters. Also, I could see many 'N's....

Now, I would like to know that why many lines start with the character '>', many A,T,C,G are in lower case letters and Many N's in the file ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz?

How to generate simulated reads from any organism's complete genome? or where i can freely download the simulated reads of any organism?

Thanks in advance.

Assembly genome • 1.8k views
ADD COMMENT
4
Entering edit mode

This software is very poular for simulating reads.

ADD REPLY
0
Entering edit mode

Thanks for the answer

ADD REPLY
2
Entering edit mode
7.2 years ago
Michael 54k

First, don't re-invent the wheel/read simulator, see NGS reads simulation

Second, why are there multiple entries in a genome fasta file '>'? There are multiple replicons or chromosomes in most genomes, only viral and some bacterial/archaeal genomes consist of a single replicon. In addition, in practice no shotgun assembly reconstructs even those replicons perfectly, but generates smaller fragments aka. contigs which can be joined into scaffolds. Those can be placed on the chromosomes or not.

ADD COMMENT
0
Entering edit mode

Thanks for the answer.

ADD REPLY

Login before adding your answer.

Traffic: 3155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6