Question: Mismatch during Merge Paired-End Reads in FLASH Algorithm
1
gravatar for clear.choi
4.9 years ago by
clear.choi30
United States
clear.choi30 wrote:

I am planning to merge two FastQ file from NGS.

And I have question with FLASH algorithm.

I have below File Forward and reverse

1. Forward

@CP000143_994500_994663_0:0:0_0:0:0_0/1
GCTTCTGCGACCGCGCCCTCGTCGTCTACCGCGGCACGCTGAACGGCGAGTTCGCGGGCGAGACGCTCGACAGCGACCTGCTCCTGGCCGCCGCCTCGGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

 

2. Reverse

@CP000143_994500_994663_0:0:0_0:0:0_0/2
GCTCTCGCTGGAGGACGGGGACAGGGCCATGGTCATTCAGGCCTCCTTTCGTTGGGCCCGCGCGCCCGAGGCGGCGGCTAGGAGCAGGTCGCTGTCGCGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

 

If Reverse Complement and Merge together results will be.

GCTTCTGCGACCGCGCCCTCGTCGTCTACCGCGGCACGCTGAACGGCGAGTTCGCGGGCGAGACTCGCGACAGCGACCTGCTCCTAGCCGCCGCCTCGGGCGCGCGGGCCCAACGAAAGGAGGCCTGAATGACCATGGCCCTGTCCCCGTCCTCCAGCGAGAGC

Compare both Sequence

GCTTCTGCGACCGCGCCCTCGTCGTCTACCGCGGCACGCTGAACGGCGAGTTCGCGGGCGAGACGCTCGACAGCGACCTGCTCCTGGCCGCCGCCTCGGG 

TCGCGACAGCGACCTGCTCCTAGCCGCCGCCTCGGGCGCGCGGGCCCAACGAAAGGAGGCCTGAATGACCATGGCCCTGTCCCCGTCCTCCAGCGAGAGC 
 

 

And There is 3 mismatch during merge two sequences

But that mismatch part Results will follow reverse complement. Anyone has idea why it happen like this?

Thank you!

paired fastq merge flash rna • 2.3k views
ADD COMMENTlink modified 4.9 years ago by Devon Ryan93k • written 4.9 years ago by clear.choi30

It's a little difficult to tell exactly what your question is. Are you asking why the reads disagree in the overlapped region (this does happen on occasion)? Are you asking why FLASH merges them the way it does, in which case is the 3rd sequence the output from FLASH or something else?

ADD REPLYlink written 4.9 years ago by Devon Ryan93k

Sorry for confused, I mean I understand that two sequences can combine together by Mistach Ratio. However, results showed me reverse complement sequences was used for combined sequences in case of different postion. so I want to know why reverse complement sequence used!

ADD REPLYlink written 4.9 years ago by clear.choi30

BTW, it looks like the quality scores are fake, that's generally a bad idea since having actual base quality scores can help resolves ambiguous base calls like you're running into here.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k

So I see actually that is just for sample, do you know what is algorithm to check quality and how can i determine ambiguous by quality score?

ADD REPLYlink written 4.9 years ago by clear.choi30
2
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

In the case of mismatches between the two sequences, FLASH will use the nucleotide with the highest base quality. Of course with fake qualities you're going to get weird results on occasion. I should note that mismatches like this only happen is something like 0.01% of alignments in bisulfite sequencing datasets...so in RNAseq it should be even less common (so it shouldn't be common enough to care about).

BTW, in the source code for FLASH (combine_reads.c, around line 400), this is how mismatches with equal phred scores are handled:

if (*qual_1 > *qual_2) {
    *combined_seq = *seq_1;
} else if (*qual_1 < *qual_2) {
    *combined_seq = *seq_2;
} else {
    /* Same quality value; take the base from the
     * first read if the base from the second read
     * is an 'N'; otherwise take the base from the
     * second read. */
    if (*seq_2 == 'N')
        *combined_seq = *seq_1;
    else
        *combined_seq = *seq_2;
}

One could argue about their choice there at the end, but that's why you get the results that you do.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan93k

Awesome ! Thanks !

 

ADD REPLYlink written 4.9 years ago by clear.choi30

Oh And could you tell me what is second read  is an 'N' ?

ADD REPLYlink written 4.9 years ago by clear.choi30

That's asking if the base seen is an N rather than one of A, T, C, or G. Obviously if one read has an N at a position, then using the other read at that position is the right idea.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 720 users visited in the last hour