Question

What are lowercase a,t,c,g refer to in a complete genome? Should lower case a,t,c,g to be converted to uppercase A,T,C,G before simulation?

1

Entering edit mode

7.2 years ago

saranpons3 ▴ 70

Hello All,

In the file complete genome file of fruit fly (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz) , I could see in many places, A,T,C,G are in lower case. The reason given in the README.txt is Repetitive sequences are in eukaryotes are masked to lower-case. If I want to generate random/simulated reads from this complete genome, should i convert the lowercase a,t,c,g to uppercase A,T,C,G before simulation? Thanks in advance.

genome Assembly • 7.4k views

ADD COMMENT • link 7.2 years ago by saranpons3 ▴ 70

1

Entering edit mode

Lower case letters can indicate low-confidence calls or masked bases. Sounds like in this case they are masked due to being repetitive. You will most closely approximate real data by converting them to upper case before generating simulated reads.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello Brian, Thanks for the answer.

"You will most closely approximate real data by converting them to upper case before generating simulated reads". Does this statement mean that i can convert lower case a,t,c,g to uppercase A,T,C,G before generating simulation reads?

Can you tell me, in practice, what would be the sequencing depth for sequencing the human genome?

Thanks in advance.

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

1

Entering edit mode

Does this statement mean that i can convert lower case a,t,c,g to uppercase A,T,C,G before generating simulation reads?

Yes.

Can you tell me, in practice, what would be the sequencing depth for sequencing the human genome?

It can be any depth, but often 30x is targeted for variant-calling.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

What about for denovo assembly?

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

0

Entering edit mode

Again, it varies. I recommend at a minimum 30x per ploidy, and preferably at least 100x per ploidy, so 200x for a diploid, for real unamplified 2x150bp Illumina reads. With simulated data you don't need as much because there's no bias, so the coverage is uniform. Also, the longer the reads are, the less coverage you generally need. Typically when people are assembling a complicated genome like human purely from Illumina reads, they use multiple libraries with different insert sizes - mostly short insert (potentially overlapping reads) with a smaller amount of coverage (say, 5x) of long-mate-pair libraries for scaffolding. These days there are other options like 10X (meaning 10X Genomics the company, not "10-fold coverage") linked reads and PacBio very long reads, which can have different coverage requirements and assembly approaches. You can, incidentally, simulate PacBio data with the BBMap package, but I don't know of any existing 10X simulators.