Question: Extending ends of sequences with the help of reads?
gravatar for manekineko
5.3 years ago by
manekineko130 wrote:


I have some hundreds of sequences and library of reads, How is possible to extend these sequences with the reads from the both end of the sequences? But I do not want the software to merge the sequences as they are separate things, just to extend them with the help of the reads as much as possible?

assembly • 3.9k views
ADD COMMENTlink modified 14 months ago by Biostar ♦♦ 20 • written 5.3 years ago by manekineko130

What sort of format would you like the output to be? I assume you're using SAM or BAM as input.

ADD REPLYlink written 5.3 years ago by Devon Ryan97k

I have the input sequences want to extend FASTA, reads in FASTA.

(I also have the sequences GFF and the reads mapped in BAM if needed).


I would like to have the extended sequences fasta and/or GFF.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by manekineko130

Ah, now I understand what you're trying to do. Stated somewhat more clearly, you have incomplete contigs (presumably from another attempt at assembly) and want to extend them given an alignment to them. Perhaps you also want to do some scaffolding but that's unclear. If that's the case, you might mention that and one of the more assembly-experienced folks here can chime in.

ADD REPLYlink written 5.3 years ago by Devon Ryan97k

A bit more different -  trying to assemble an mRNA transcripts as I have dome part of each one (somewhare in the middle of it) and have a lots of reads

ADD REPLYlink written 5.3 years ago by manekineko130
gravatar for Brian Bushnell
5.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I recently developed an assembler, Tadpole, for this problem.  It's part of the BBMap package.  Usage: in=sequences.fa extra=reads.fq out=extended.fa extendleft=200 extendright=200 ibb=f mode=extend

It will extend the sequences using kmer counts.  You can set the length to extend as long as you want, but that's just an upper limit; it will stop earlier if it hits a branch.

(edit: modified command line to include "mode=extend" which I had forgotten)

(edit2: modified for updated syntax in v37.33+)

ADD COMMENTlink modified 3.3 years ago • written 5.3 years ago by Brian Bushnell17k

Hi, seems what I need, is it gona working if my reads are not fastq but fasta? And if each sequence can be extended differently depending on reads, how to figure out the extendleft and extendright values?

ADD REPLYlink written 5.3 years ago by manekineko130

It accepts fasta and fastq.  And you can just set the extendleft and extendright numbers to something high like 2000 if you want.  They will only be extended until a branch is encountered in the DeBruijn graph, which depends on the organism.  So, if you set them to 10, almost all the sequences will be extended by 10bp.  If you set it to 1 million, none of them will be extended to one million - they'll all stop somewhere before that, since the extension will only continue as long as there is a single unambiguous best path (according to the thresholds you set).  Therefore, just set them to a number X such that you don't want anything to extend more than X.

ADD REPLYlink written 5.3 years ago by Brian Bushnell17k

Is this possible to use long-reads in tadpole?

ADD REPLYlink written 2.2 years ago by BioGeek150

Hi all, I need to clarify this. The syntax has changed a little for extending existing contigs using new reads... I'll update it later today.

ADD REPLYlink written 3.4 years ago by Brian Bushnell17k

Hi! I have two sets of reads: mit_1.fastq y mit_2.fastq and I want to use them to extend my contigs. I want to use tadpole but the parameter extra=reads.fq, confuse me. How should I put my two sets of reads? (mit_1.fastq y mit_2.fastq) Perhaps: extra=mit_1.fastq,mit_2.fastq ?

Thank you in advance for you help :)

ADD REPLYlink written 15 months ago by macielrodriguez230

I am using this tool this week. May I ask you if there is a way to make it dump the unused reads in the mode=extend? I saw that there is a outd option for "Write discarded reads, if using junk-removal flags", but I'm not sure that's what I am actually looking forward. I'd like to retrieve a FASTA file with reads that have not been used to extend the template.

ADD REPLYlink written 2.7 years ago by Macspider3.2k
gravatar for Pierre Lindenbaum
5.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Interesting challenge , I just wrote a tool ( ) that takes as input

  • an indexed fasta REFERENCE sequence
  • some indexed BAM


the (clipped) overlapping reads are used to extend the REF sequence in 5', 3' and in the contigs containing 'N'.

$  java   -jar dist/extendrefwithreads.jar \
     -R human_g1k_v37.fasta -f 0.3 \
     f1.bam f2.bam f3.bam 2> /dev/null |\
  cat -n | grep -E '(>|[atgc])' 

     1  >1
   168  cctaaccctcnccctntnccnncnncccnncttcttccgaTAACCCTAACCCTAACCCTA





ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Pierre Lindenbaum131k
gravatar for h.mon
5.3 years ago by
h.mon31k wrote:

You may also try Mapsembler2.

ADD COMMENTlink modified 2.3 years ago • written 5.3 years ago by h.mon31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour