Question

De novo assembly of paired end reads for small separate genes

0

Entering edit mode

4.2 years ago

sloustub1 • 0

Was certainty remaining engrossed applauded sir how discovery. Settled opinion how enjoyed greater joy adapted too shy. Now properly surprise expenses interest nor replying she she. Bore tall nay many many time yet less. Doubtful for answered one fat indulged margaret sir shutters together. Ladies so in wholly around whence in at. Warmth he up giving oppose if. Impossible is dissimilar entreaties oh on terminated. Earnest studied article country ten respect showing had. But required offering him elegance son improved informed.

Out too the been like hard off. Improve enquire welcome own beloved matters her. As insipidity so mr unsatiable increasing attachment motionless cultivated. Addition mr husbands unpacked occasion he oh. Is unsatiable if projecting boisterous insensible. It recommend be resolving pretended middleton.

EDIT by @RamRS:

OP edited their post on 25-Feb-2020 and replaced actual content with the random gibberish above. Original content retrieved from Google Cache:

Hey, I need an advice on what assemblers or assembly methods could be used for a specific problem of de novo reads assembly. In general, I have paired end reads (2x300) of hundreds of unique genes (see image), length of the gene is 1500 bp. One of the reads (first) always start at the beginning of the gene but the paired read (second) starts in a random location. The sequence of the first read may be identical for couple of genes, but there exist differences further away from the gene and these differences should be identified by the second read. Theoretically, the paired reads should allow to reconstruct full gene sequence as second read covers whole gene.

I have made an in silico library of paired-end reads that should simulate scenario outlined above (no errors were introduced into sequences). I used this library with spades and abyss assemblers, however these weren't able to reconstruct the initial library completely. Another question is whether separate genes could be identified if they differ by 10-100 nucleotides? If anyone has any advice on assemblers to use or parameters to try out, you are very welcome to share.

Notice to sloustub1

If this behavior is repeated, your account will be suspended.

ChIP-Seq • 1.4k views

ADD COMMENT • link 2.0 years ago by sloustub1 • 0

0

Entering edit mode

I used this library with spades and abyss assemblers, however these weren't able to reconstruct the initial library completely.

How much sequence did you use for the reconstruction? Generally you don't want to have > 100x? Since you are doing something unusual even less may work better.

Other than the assembly what is the second important objective? To find rare SNP?

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Regarding the sequence coverage, I've tried different combinations, mainly changing two factors: number of unique paired reads per gene (tried varying in the range of 10-150) and number of times each unique paired read is repeated in the dataset (1-100). I've mainly noticed that number of identifiable genes rises with increased coverage. I've had most luck with spades as it gives larger contigs, however it is still far from ideal (in the best case scenario I've been able to reconstruct 73 out of 100 genes).

I think that I can get better results, I'm just not sure what would be a right tool or method for contig assembly. I guess that theoretically this problem is more similar to de novo assembly of RNA-seq data than genome assembly?

The main objective is to identify number and sequences of unique genes in the pool. The sequence to sequence similarity will vary, with some sequences having differences of hundreds of nucleotides to others having difference of only a few (five or more).

ADD REPLY • link 4.2 years ago by sloustub1 • 0

0

Entering edit mode

Is that 10-150 reads or 10-150x coverage (in terms of raw bases)? How will these reads be generated in practice (PCR?)? Have you only done a simulation for now?

The main objective is to identify number and sequences of unique genes in the pool. The sequence to sequence similarity will vary, with some sequences having differences of hundreds of nucleotides to others having difference of only a few (five or more).

Sounds to me like you need to look at pan-genome analysis tool as an option rather than assembly.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

It's 10-150 reads (number of reads with unique positions of second read, location and sequence of first read doesn't change). Yes, in practice the reads will be generated using PCR, with constant location of the first primer and random locations of the second primer, thus one paired read is always the same and second read varies. Yes, only simulations for now.

Right now I'm not sure how would the pan-genome analysis tool help? For subsequent analysis I'll need to have full gene sequences and for this I'll need to have them assembled from shorter reads.

ADD REPLY • link 4.2 years ago by sloustub1 • 0

0

Entering edit mode

It may be worth trying 10x (and up) reads (in terms of "X' coverage based on the length of the gene). Assemblers may be suffering from too sparse data.

If you are doing to do this with PCR then think about incorporating a UMI so you can sort the reads based on that (if you are going to pool for sequencing) and then do an alignment/assembly.

For now I suggest trying multiple sequence alignments to see if you are able to assemble the reads that way.

ADD REPLY • link 4.2 years ago by GenoMax 141k

score 0 · Answer 1 · 2020-02-11

0

Entering edit mode

4.2 years ago

Brice Sarver ★ 3.8k

Check out the approach implemented in ARC: Assembly by Reduced Complexity.

ADD COMMENT • link 4.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Sadly, I won't have a reference to use directly with ARC.

ADD REPLY • link 4.2 years ago by sloustub1 • 0

0

Entering edit mode

Ah, misread what you had. Thanks for clarifying!

ADD REPLY • link 4.2 years ago by Brice Sarver ★ 3.8k