I am focusing on the reconstruction of a ~2Mb plant genomic region. What I have is some BAC sequences from there and Pacbio reads from the whole genome.
I am interested in extending the non-overlapping BAC sequences with Pacbio reads up to be able to merge them and obtain a unique reference sequence (per haplotype).
I read this post (A: Extending ends of sequences with the help of reads?) and decided to start with tadpole. As a preliminary test, I merged the reads from 3 overlapping BACs and tried the following command:
tadpole.sh in=bac.fasta extra=reads.fasta out=extended_bac.fa extendleft=10000 extendright=10000 ibb=f mode=extend k=62
The output sequence was 3 nt longer. That is not as much as expected, but it worked.
Then I tried the same command but using the whole genome dataset. Unfortunately, it ran out of memory, even when using the -Xmx20g or -Xmx200m options.
It should be said that in the latter case I used reads that were already error-corrected and trimmed by Canu.
I also wanted to normalize the data to decrease coverage using BBNorm until I read that it only suits for short reads. I however found no other way for that purpose.
Now here are my questions:
Is there a way to work on the genome dataset? And providing it is possible, is there a trick to get longer extensions?
Thanks in advance !