Question

Assembly of two ends of the sequence

0

Entering edit mode

7.0 years ago

332843636 • 0

Hi,

Does any one know whether a assembler can get the accurate assembly of the two ends of the sequence?

Thanks!

Sorry for not making myself clear. Actually, I want to know how the assemblers deal with the end of sequence. How do they know they have reached the end of the sequence while doing assembly? Can they give out accurate end of the sequence?

sequencing • 1.3k views

ADD COMMENT • link 7.0 years ago by 332843636 • 0

1

Entering edit mode

Why would you need an assembler if you are referring to stitching together R1/R2 reads from a fragment, if they overlap in the middle. bbmerge.sh from BBMap suite or FLASH are appropriate tools to do that read merge.

If you are referring to assembling two ends of a contig to get a closed circle then that is a different question. Please add some additional explanation by editing this post.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

I edited this post. Actually, I sequenced the whole genome of a virus. I want to know whether de novo assembly can give out accurate sequence at the end of the genome.

ADD REPLY • link 7.0 years ago by 332843636 • 0

0

Entering edit mode

Besides, the two ends of the genome have relative low coverage than the middle part.

ADD REPLY • link 7.0 years ago by 332843636 • 0

1

Entering edit mode

You may want to take a look at tadpole.sh from BBMap suite for doing the assembly (denovo-assembly with novel genes inserted ).

Tagging: Brian Bushnell

Brian can provide additional insight.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Tadpole seems to do a very good job of viral assembly, so I recommend it for that purpose. Your post's title is misleading, though:

"Assembly of two ends of the sequence"

As far as I can tell, this has nothing to do with your goal, so I suggest you change it. Tadpole can typically generate a good viral assembly from shotgun reads, but your title is completely unrelated to this task.

Edit - I misinterpreted your question. It appears that you want to assemble the outermost ends of your genome... I interpreted as being related to read pairs. Anyway, Tadpole has no problem assembling the ends of viral contigs, as long as they are concordant.

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

While the ends part is true if OP is willing to try tadpole then it would not hurt to do a second assembly with BBtools. Who knows OP may get a better assembly.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

please provide more details regarding your data.

ADD REPLY • link 7.0 years ago by st.ph.n ★ 2.7k

score 0 · Answer 1 · 2017-04-28

You may use the code in my project (https://github.com/OpenGene/MutScan, Detect and visualize target mutations by scanning FastQ files directly).

It is based on minimum hamming/edit distance searching.

https://github.com/OpenGene/MutScan/blob/master/src/read.cpp

Read* ReadPair::fastMerge(){
    Read* rcRight = mRight->reverseComplement();
    int len1 = mLeft->length();
    int len2 = rcRight->length();
    // use the pointer directly for speed
    const char* str1 = mLeft->mSeq.mStr.c_str();
    const char* str2 = rcRight->mSeq.mStr.c_str();
    const char* qual1 = mLeft->mQuality.c_str();
    const char* qual2 = rcRight->mQuality.c_str();

    // we require at least 30 bp overlapping to merge a pair
    const int MIN_OVERLAP = 30;
    bool overlapped = false;
    int olen = MIN_OVERLAP;
    int diff = 0;
    // the diff count for 1 high qual + 1 low qual
    int lowQualDiff = 0;

    while(olen <= min(len1, len2)){
        diff = 0;
        lowQualDiff = 0;
        bool ok = true;
        int offset = len1 - olen;
        for(int i=0;i<olen;i++){
            if(str1[offset+i] != str2[i]){
                diff++;
                // one is >= Q30 and the other is <= Q15
                if((qual1[offset+i]>='?' && qual2[i]<='0') || (qual1[offset+i]<='0' && qual2[i]>='?')){
                    lowQualDiff++;
                }
                // we disallow high quality diff, and only allow up to 3 low qual diff
                if(diff>lowQualDiff || lowQualDiff>=3){
                    ok = false;
                    break;
                }
            }
        }
        if(ok){
            overlapped = true;
            break;
        }
        olen++;
    }

    if(overlapped){
        int offset = len1 - olen;
        int mergedLen = offset + len2;
        stringstream ss;
        ss << mLeft->mName << " merged offset:" << offset << " overlap:" << olen << " diff:" << diff;
        string mergedName = ss.str();
        string mergedSeq = mLeft->mSeq.mStr.substr(0, offset) + rcRight->mSeq.mStr;
        string mergedQual = mLeft->mQuality.substr(0, offset) + rcRight->mQuality;
        // quality adjuction and correction for low qual diff
        for(int i=0;i<olen;i++){
            if(str1[offset+i] != str2[i]){
                if(qual1[offset+i]>='?' && qual2[i]<='0'){
                    mergedSeq[offset+i] = str1[offset+i];
                    mergedQual[offset+i] = qual1[offset+i];
                } else {
                    mergedSeq[offset+i] = str2[i];
                    mergedQual[offset+i] = qual2[i];
                }
            } else {
                // add the quality of the pair to make a high qual
                mergedQual[offset+i] =  qual1[offset+i] + qual2[i] - 33;
            }
        }
        delete rcRight;
        return new Read(mergedName, mergedSeq, "+", mergedQual);
    }

    delete rcRight;
    return NULL;
}