I have paired-end Illumina reads from several BACs of an unsequenced plant species. After assembling them using SOAPdenovo, I realized the total assembled size for each BAC ranged from 1M to 2M approximately (way too much). So, I blasted some scaffolfs and finally, I concluded that they sequenced the whole BACs (100% identity with E. coli).
I thought two ways to handle this:
- 1) align reads over genome of E. coli and downloaded BAC sequences from NCBI? and take only the unaligned ones to perform assembly de novo.
- 2) perform assembly de novo of all reads, and make contigs and scaffolds. Then I'd put all scaffolds of all BACs together into a fasta file and I'd remove redundancy with any tool (e. g. CAP3... Do you know any other and better tool?) in order to decrease the time needed to blast. Scaffolds with hits would be discarted.
According to your experience, what would you do??