Question

De Novo Assembly With 1X Coverage?

0

Entering edit mode

12.1 years ago

Biomonika (Noolean) 3.2k

Hi guys,

I am working with 454 reads with 1x coverage. In order to find centromeric repetition, I was thinking about assembling reads and then trying to recognize repetitive regions. However, I am not sure how much sense it makes to assembly reads with such coverage.

Any help will be appreciated. Thanks.

EDIT: Thank you very much for the answers. I decided to reformulate problem. I have 454 reads with 1x coverage from Cardamine rivularis and my aim is to find centromeric repetition. Therefore there are two basic approches - looking for repetition in raw reads and looking for repetition in assembly (use of this is ambiguous with this coverage).

I have also had idea to use centromeric repeats from other species as reference sequences and try to map reads on them. In case that repeat from Cardamine would be similiar (what is not probable:), this could work.

I will postpone checking answer untill I will try:) Comments and discussion still more than welcome and thanks a lot for answers and comments discussed so far.

assembly clustering • 3.8k views

ADD COMMENT • link updated 12.1 years ago by lexnederbragt ★ 1.3k • written 12.1 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

At 1X coverage, working with raw reads is probably better than assembly. At least for human, centromeres are mostly imperfect satellite repeats. I do not think at 1X you can get more from assembly.

ADD REPLY • link 12.1 years ago by lh3 33k

1

Entering edit mode

De novo assembly is not the right method for detecting highly repetitive regions. Try clustering of sequence reads instead, e.g. cd-hit, or cd-hit 454 see the link below.

ADD REPLY • link 12.1 years ago by Michael 54k

1

Entering edit mode

http://biostar.stackexchange.com/questions/1968/how-to-cluster-454-reads/1969#1969

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.1 years ago by Michael 54k

1

Entering edit mode

Here is a relevant paper where k-mer frequency spectra from 454 reads was used to characterize centromeric regions in rice. http://bioinformatics.oxfordjournals.org/content/26/17/2101.full

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

Which species do have? Did you consider getting the centromeric repetitions directly from the reads?

ADD REPLY • link 12.1 years ago by Christof Winter ★ 1.0k

0

Entering edit mode

It is from Cardamine rivularis. The thing is that centromeric repeats use to be quite long (180bp in Arabidopsis) and therefore hard to find in 454 data.

ADD REPLY • link 12.1 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Have you tried running RepeatMasker on your reads?

ADD REPLY • link 12.1 years ago by Jeremy Leipzig 22k

0

Entering edit mode

The length of 454 reads has reached 300bp for several years. Your (alpha?) satellite unit should be contained in one read.

ADD REPLY • link 12.1 years ago by lh3 33k

score 2 · Answer 1 · 2012-03-05

2

Entering edit mode

12.1 years ago

SES 8.6k

I'm guessing you study maize (by your picture), and the centromeric tandem repeats in this species consist of monomers of ~156 bp, which is much shorter than your 454 reads. I agree with others that searching the reads would be a better approach. Also, active maize centromeres contain a specific clade of retrotransposons (appropriately called centromeric retrotransposons in maize, or CRM elements) which you can easily find in repeat databases or in GenBank for searching your reads.

ADD COMMENT • link 12.1 years ago by SES 8.6k

0

Entering edit mode

It looks like I guessed wrong about your study system, but Cardamine is a cool genus to study chromosome evolution. RepeatMasker may be slow if you have a lot of reads, but is good to try for identifying known repeats, as Jeremy suggested. Also, give TRF (http://tandem.bu.edu/trf/trf.html) a try as you might be more likely to identify centromeric repeats specific to your species.

ADD REPLY • link 12.1 years ago by SES 8.6k

score 1 · Answer 2 · 2012-03-06

It could actually be worth your while trying to do an assembly with these reads. At that low coverage, any alignment between reads the assembler can find indicates that the reads are from repeats. So, any of the (probably few) contigs produced by the assembler is a candidate repeat, thereby reducing the search space drastically relative to searching all the reads. If you intend to use newbler from 454, I recommend increased stringency settings for the overlaps to prevent newbler from spending a lot of time looking for spurious overlaps (minimum overlap length -ml 60, or 80, or even higher, minimum overlap identity 98%, for example).