Assembly of two ends of the sequence
1
0
Entering edit mode
7.0 years ago
332843636 • 0

Hi,

Does any one know whether a assembler can get the accurate assembly of the two ends of the sequence?

Thanks!

Sorry for not making myself clear. Actually, I want to know how the assemblers deal with the end of sequence. How do they know they have reached the end of the sequence while doing assembly? Can they give out accurate end of the sequence?

sequencing • 1.3k views
ADD COMMENT
1
Entering edit mode

Why would you need an assembler if you are referring to stitching together R1/R2 reads from a fragment, if they overlap in the middle. bbmerge.sh from BBMap suite or FLASH are appropriate tools to do that read merge.

If you are referring to assembling two ends of a contig to get a closed circle then that is a different question. Please add some additional explanation by editing this post.

ADD REPLY
0
Entering edit mode

I edited this post. Actually, I sequenced the whole genome of a virus. I want to know whether de novo assembly can give out accurate sequence at the end of the genome.

ADD REPLY
0
Entering edit mode

Besides, the two ends of the genome have relative low coverage than the middle part.

ADD REPLY
1
Entering edit mode

You may want to take a look at tadpole.sh from BBMap suite for doing the assembly (denovo-assembly with novel genes inserted ).

Tagging: Brian Bushnell

Brian can provide additional insight.

ADD REPLY
0
Entering edit mode

Tadpole seems to do a very good job of viral assembly, so I recommend it for that purpose. Your post's title is misleading, though:

"Assembly of two ends of the sequence"

As far as I can tell, this has nothing to do with your goal, so I suggest you change it. Tadpole can typically generate a good viral assembly from shotgun reads, but your title is completely unrelated to this task.

Edit - I misinterpreted your question. It appears that you want to assemble the outermost ends of your genome... I interpreted as being related to read pairs. Anyway, Tadpole has no problem assembling the ends of viral contigs, as long as they are concordant.

ADD REPLY
0
Entering edit mode

While the ends part is true if OP is willing to try tadpole then it would not hurt to do a second assembly with BBtools. Who knows OP may get a better assembly.

ADD REPLY
0
Entering edit mode

please provide more details regarding your data.

ADD REPLY
0
Entering edit mode
7.0 years ago
chen ★ 2.5k

You may use the code in my project (https://github.com/OpenGene/MutScan, Detect and visualize target mutations by scanning FastQ files directly).

It is based on minimum hamming/edit distance searching.

https://github.com/OpenGene/MutScan/blob/master/src/read.cpp

Read* ReadPair::fastMerge(){
    Read* rcRight = mRight->reverseComplement();
    int len1 = mLeft->length();
    int len2 = rcRight->length();
    // use the pointer directly for speed
    const char* str1 = mLeft->mSeq.mStr.c_str();
    const char* str2 = rcRight->mSeq.mStr.c_str();
    const char* qual1 = mLeft->mQuality.c_str();
    const char* qual2 = rcRight->mQuality.c_str();

    // we require at least 30 bp overlapping to merge a pair
    const int MIN_OVERLAP = 30;
    bool overlapped = false;
    int olen = MIN_OVERLAP;
    int diff = 0;
    // the diff count for 1 high qual + 1 low qual
    int lowQualDiff = 0;

    while(olen <= min(len1, len2)){
        diff = 0;
        lowQualDiff = 0;
        bool ok = true;
        int offset = len1 - olen;
        for(int i=0;i<olen;i++){
            if(str1[offset+i] != str2[i]){
                diff++;
                // one is >= Q30 and the other is <= Q15
                if((qual1[offset+i]>='?' && qual2[i]<='0') || (qual1[offset+i]<='0' && qual2[i]>='?')){
                    lowQualDiff++;
                }
                // we disallow high quality diff, and only allow up to 3 low qual diff
                if(diff>lowQualDiff || lowQualDiff>=3){
                    ok = false;
                    break;
                }
            }
        }
        if(ok){
            overlapped = true;
            break;
        }
        olen++;
    }

    if(overlapped){
        int offset = len1 - olen;
        int mergedLen = offset + len2;
        stringstream ss;
        ss << mLeft->mName << " merged offset:" << offset << " overlap:" << olen << " diff:" << diff;
        string mergedName = ss.str();
        string mergedSeq = mLeft->mSeq.mStr.substr(0, offset) + rcRight->mSeq.mStr;
        string mergedQual = mLeft->mQuality.substr(0, offset) + rcRight->mQuality;
        // quality adjuction and correction for low qual diff
        for(int i=0;i<olen;i++){
            if(str1[offset+i] != str2[i]){
                if(qual1[offset+i]>='?' && qual2[i]<='0'){
                    mergedSeq[offset+i] = str1[offset+i];
                    mergedQual[offset+i] = qual1[offset+i];
                } else {
                    mergedSeq[offset+i] = str2[i];
                    mergedQual[offset+i] = qual2[i];
                }
            } else {
                // add the quality of the pair to make a high qual
                mergedQual[offset+i] =  qual1[offset+i] + qual2[i] - 33;
            }
        }
        delete rcRight;
        return new Read(mergedName, mergedSeq, "+", mergedQual);
    }

    delete rcRight;
    return NULL;
}
ADD COMMENT
0
Entering edit mode

Can you explain a bit more?

BTW, I was wondering whether anyone can tell me how the assemblers deal with the end of sequence. How do they know they have reached the end of the sequence and can they give out accurate end sequence?

ADD REPLY
0
Entering edit mode

Have you benchmarked this against other read-merging programs, to see how well it performs, in terms of true-positive and false-positive merges? Or, in terms of speed? From reading the code, it does not appear that it would perform very well by either metric...

ADD REPLY
0
Entering edit mode

It's not a critical part of my project, so I didn't benchmark it. But it is usually as fast as I/O speed.

ADD REPLY

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6