Question: assembly on a single DNA sequence
gravatar for gbl1
7 months ago by
gbl160 wrote:


I wish to assemble sequences around a given one.

To explain my goal, I have an about complete genome with frangments 300-500 bp, I have sequences of what I am actually working on 500-1000 bp. I am looking to expand my sequences, by adding what I miss on each side, as far as possible. so when it is not safe anymore (repetitive sequence for exemple) it just stops… and actually, if I obtain something 5 000-10 000 pb, it is more than enough for my purpose.

Any clue of the best thing to do?

assembly • 368 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by gbl160

So the easiest:

Usage: extendrefwithreads [options] Files

So, after instalation, I ran "extendrefwithreads my file" "commande not found", then a "locate extendrefwithreads" gives nothing back…

Then… Mapsembler2, can't tell from home what was the command I put in, for I was university…

I started the command at 10 in the morning… it finished at 16… and just added in the best case 10 pb, with a lot of alternative ends…

Sorry if I'm on the nerves… I came to this forum reading a lot from it after many search that ended here, with contents that was related to my work… but none actually answered! The bioinformaticist of the team I am currently in was supposed to give me primers from the data, but he said "I'won't search on a local scale for I am planing to make a whole assembly of the genome" … he was never fine with an assembly… but did not wanted to "waste time" in just looking for what I needed to… id est only 1000 FBP around a dozen of sequences! He left for holidays, I just have 6 weeks left… and I'm wasting my time in order to obtain results that he could have computed in one night over! Sorry if I may look angry now a day, for I'm a biologist asked to make programming a computer guy hasn't succeed in… Sorry for being sensitive when I ask to someone of my faculty that is paid for computer issues I get as an answer "I don't know, but you should ask to the guy in the 3rd floor that is used to help other PhD student with bioinformatic issues, he may find a solution" … when the 3rd floor guy is actually myself, unpaid PhD student, suffering from a "STD" (Syndrome of Thesis Decay). Sorry… I don't even ask for a GUI… I just ask for a "MagicCommand -input MyData -data FuckingHugeDataBase" that answers me "SuitableResults.fasta"

When I had class of computer science in bachelor degree, I was told "If you feel like doing something pointless, repetitive and boring, someone had already program something to do it automatically before" If you keep doing it, you suffer from the "Woody woodpicker syndrom"! I look for something actually basic… why is it hidden? Why is a function that anyone may need at some point, about impossible to obtain? I know NCBI blast since my first year master degree in 2005, I was advised to use the NCBI workbench in this forum… Why this FS does not have a huge "Blast" bouton? Why when I google Gblast and blast, I just get the description, tellin' me the Gblast do do blast, and nothing about how to get it? Why when I asked on researchgate "what software I could use in order to color part of a 3D structure of a protein", I was told any software, with a list of program I tried in vain, that, indeed, had the function… really hidden somewhere, accecible with a shortcut you had to know first?

When I, biologist, make a OS code, I put a clear description, proof here: … I expect you, bioinformaticist, to do the same and provide clear instructions, and minimal working exemple on your work… else, no one will ever use your work, but, that's not of my buisiness!

ADD REPLYlink written 7 months ago by gbl160

I understand you are stressed, and I sympathize with you for your problems, so I will start with a complete walk-through with tadpole, from downloading to running it:

cd /path/to/sequences/
tar -xzf BBMap_38.11.tar.gz
export PATH=$PWD/bbmap:$PATH in=sequences.fa extra=reads.fq.gz out=extended.fa extendleft=500 extendright=500 ibb=f mode=extend

However, I have to state a couple of points:

1) you are assuming the local bioinformatician is not helping out of spite or laziness, but he may be over-worked, or with some late work his boss is demanding results, and he is stressed like you. You don’t know how many people reach out to him with all kinds of new problems, everybody expecting him to solve quickly because “it is easy” for him.

2) you still didn’t tell exactly the commands you tried.

3) from your other thread ( How to start with blast and a local database ), it seems you have limited computing resources. How much memory, CPUs and disk you have available?

4) also from the other thread, I am not even certain you have an assembled genome, instead I think you have raw sequencing reads. Could you clarify?

ADD REPLYlink written 7 months ago by h.mon23k

Hi, thanks for your answer, on this wonderful sunny saturday, from my lab, here is the command:

1) I don't say he is lazy, I know he is hard working, and competant. I just say he, like many informaticists, want to do things in the most elegant way, that is time and resource costly, and sometime error proof, whereas just making things the crapy way was actually way enough.

295 ./ -s ~/Desktop/Leymus-genome-test/Leymus-seq-clean.fasta -r ~/Desktop/part-of-sequences/Leymus-all.fasta -t 1

3) memory: 15 GiB, 4 processors 3,20GHz 75 GiB availlable space

4) I never said I had an " assembled genome" I have a "complete genome with frangments 300-500 bp" then, it is indeed a lot of reads, enough to cover everything at least twice and that my bioinformatician tried, in vain to assemble, whereas... I only needed a simple thing...

ADD REPLYlink written 7 months ago by gbl160

I never said I had an " assembled genome" I have a "complete genome with frangments 300-500 bp"

This is a terminology misunderstanding then, I don't know what you mean by "complete genome with frangments 300-500 bp". What you mean is you have sequencing reads? Which technology, seems like Illumina? Strictly speaking, there are no "complete genomes", either when we refer to assembled genomes (always incomplete, even when of very high quality), or sequencing reads.

I woyld try to assemble with MEGAHIT, create a blast db with the "final.contigs.fa" file, and then blast the sequences of interest against this db. MEGAHIT is fast and may be able to assemble a draft genome of sufficient quality in one or two days - but 75Gib of disk space may be too few.

ADD REPLYlink written 7 months ago by h.mon23k

yes, there is 16 GO of reads, that covers normally the whole genome, I think it is illumina, but not sure...

Assembly is an issue in itself, I tried with MECAT, and it said for such an assembly, it would need one TB of RAM to proceed... basically, I want:

let's say my sequences I want to extend are: MFS.fa and the 16GO of reads FHDB.fa

0) It takes the bases of a sequence of MFS.fa =seqE=seqC

1) secE between 10-110 position, blast it in FHDB.fa

if no answer then return secE and secC && STOP

if single answer then Merge answer with secE and secC then GO TO 3

If number of answer >1 then GO TO 2

2) compare answer 1 and 2

If match merge answer1 with secE and answer2 with secC

if mismatch then return secE and secC && STOP

3) compare secE and secC

if mismatch then return secE and secC && STOP

If match GOTO1

I'm not a good coder, I don't know how to program it, and of course it should do it in both direction

ADD REPLYlink modified 7 months ago • written 7 months ago by gbl160
gravatar for h.mon
7 months ago by
h.mon23k wrote:

There are three suggestions on Extending ends of sequences with the help of reads?

  • Tadpole
  • jvarkit's ExtendReferenceWithReads
  • Mapsembler2
ADD COMMENTlink written 7 months ago by h.mon23k


Tadpole -> impossible to find the software
jvarkit's ExtendReferenceWithReads -> looks great, but undocumented, therefore impossible to use... after installation, I even could not  find the executable files! 
Mapsembler2 -> has run the whole day... has not  Extend anything! final file was crapy
ADD REPLYlink written 7 months ago by gbl160

Tadpole is part of the BBMap package, so when you install BBMap, you install Tadpole. The link to the thread I provided above has:

1) a link to BBMap

2) an example on how to use jvarkit.

has run the whole day... has not Extend anything! final file was crapy

This is not feedback, it is just a rant. If you want to provide feedback and / or get help, describe how you run the software. Mapsempler2 and jvarkit authors post here, and if you provide a detailed description of what you did, they can help you overcome the problems you may have along the way.

ADD REPLYlink written 7 months ago by h.mon23k

For the use of jvarkit, Here are my commands:

306 git clone "" 307 cd jvarkit 308 make extendrefwithreads 309 ls 310 cat 311 ls 312 extendrefwithreads 313 ./extendrefwithreads 314 locate extendrefwithreads

I think it is funny...

ADD REPLYlink written 7 months ago by gbl160
gravatar for gbl1
7 months ago by
gbl160 wrote:


So, I managed to extend my elements by a long manual method (Blast and assembly), I do not have to rush anymore, but I still would like to ba able to get an assembly on a sequence from reads. So, I give the results I obtain:

Initial sequence:

>2-1-u Leymus arenarius predicted callose synthase 

result of: -s input.fa -r ~/Desktop/output.fasta -t 2


and results of in=input.fra extra=~/Desktop/part-of-sequences/Leymus-all.fasta out=extended-seq.fa extendleft=500 extendright=500 ibb=f mode=extend

>2-1-u Leymus arenarius predicted callose synthase 

I would like to not that mapsembler, instead of proposing additional ends, proposed alternative ends for the provided sequences..

ADD COMMENTlink written 7 months ago by gbl160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1242 users visited in the last hour