Question

How To Calculate Overlap Read Lengths

1

Entering edit mode

10.6 years ago

am ▴ 60

Hello, I used Pear to merge my Illumina (MiSeq sequencer) pair-end reads. I'd like to know the overlap length in each of the merged reads in order to calculate min length, max, average, mode, median etc... Can you suggest my how to do? I tried with a perl script (length R1 + length R2 - length MergedR1R2) but I am not very good in programming.... Can anybody help me and tell me how to do this? Thanks!!

overlap length reads • 6.5k views

ADD COMMENT • link updated 10.6 years ago by Neilfws 49k • written 10.6 years ago by am ▴ 60

1

Entering edit mode

can you post an example of input file, pls?

ADD REPLY • link 10.6 years ago by User000 ▴ 710

0

Entering edit mode

I have 3 files in fastq format. R1.fastq, R2.fastq, pear_R1R2.assembled.fastq.

This is an example of what I am talking about. I show only the sequence (2nd line in the fastq file) R1 A G T G C A C T A G T G G T R2 C A C T A G T G G T A T A A Pear R1/R2 A G T G C A C T A G T G G T A T A A I want to extract the length of the overlap region C A C T A G T G G T ( 10 in this case) , for every read contained in the pear_R1R2.assembled.fastq file.

Thank you very much!

ADD REPLY • link 10.6 years ago by am ▴ 60

score 3 · Answer 1 · 2014-03-19

3

Entering edit mode

10.6 years ago

User000 ▴ 710

You can simply consider the two seq-s and find the overlapping region. Using python:

import difflib
s1 = "AGTGCACTAGTGGT"
s2 = "CACTAGTGGTATAA"

def overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    a, b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[a:a+size]

print len((overlap(s1, s2)))

ADD COMMENT • link 10.6 years ago by User000 ▴ 710

0

Entering edit mode

Thank you! this very useful but I think I wasn't clear about what I was asking. I would like to know if Pear respected the parameters i gave it (min overlap length= 50 etc..) inspecting the length of the overlap region contained in every merged read contained in pear_R1R2.assembled.fastq file. Sorry for my english. :)

ADD REPLY • link 10.6 years ago by am ▴ 60

0

Entering edit mode

Sorry, I dont understand very well your point. The problem is your pear file is going to output an assembled sequence not the overlapping one, how can you extract from it the overlap region if there are no input seq-s? May be I am missing something. But I think you can first find overlapping regions separately and then calculate min length, max, average, mode, median etc.

ADD REPLY • link 10.6 years ago by User000 ▴ 710

0

Entering edit mode

You are so rigth. This is the reason why I was thinking about doing something like length of read1 + length of read 2 - length of assembled read 1/2. So if R1 A G T G C A C T A G T G G T length is 14 and R2 C A C T A G T G G T A T A A length is 14 and their sum is 14+14= 28 while Pear R1/R2 A G T G C A C T A G T G G T A T A A is 18 bp long, the overlap region ( ~ shared by the the two reads) would be 28 -18 = 14 bp long C A C T A G T G G T. I am not looking at single base identity comparing the sequences because I know that during the assembly Pear chose the base with the best quality score according to its position. "Now, assume two reads x and y can be merged and have an overlapping region C. PEAR will correct errors in the overlapping region and compute updated quality scores for the overlap. For every pair of corresponding observed bases X and Y in C and their quality scores eX and eY,, respectively, we distinguish four cases—X and Y are identical, different, one of them is uncalled or both of them are uncalled. When the two bases are identical, PEAR simply inserts this base into the corresponding position in the merged sequence and assigns the product of the quality scores: because errors are independent from each other (see Section 2.1). When the base pairs are different, PEAR inserts the base with the highest quality score and the corresponding quality score. If (only) one of the two bases is uncalled (N), PEAR uses the called base and its quality score. Finally, if both bases are uncalled, we arbitrarily use the lower of the two quality scores, as a quality score is required to obtain a valid FASTQ output file. " This reasoning maybe doesn't make sense. I'll think about it.

ADD REPLY • link 10.6 years ago by am ▴ 60

0

Entering edit mode

can you post one of your fastq file example? I want to see the format

ADD REPLY • link 10.6 years ago by User000 ▴ 710

0

Entering edit mode

R1

@M01598:61:000000000-A723A:1:1101:21164:1373 1:N:0:1 CTGGCTGAATGAGACTGGTGTCGACACTAGTGGTACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGCGACGCAGAAGACTGGTGATTTCTGCATTTCCATCTGAGGTACTGGCTTCATCTCACTACGTAGTGCCAGACAGTGGGCGCACTCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCACCTCGGAAGCGCCAG + -8A<cgeffefc<fgg9@ccff,8c;ff@fc<,efggcaf9aa9eee<,@d,@@fffgd?f9<c<:c@ccf8@<fgefgfgd7@@befa8cffgf9<efffd@c@@cccc++,,,:a,,c,,@eaa@,c,b@??e9@?e,:,4a,cf98,afcdff?fea?,,+,,,8,8>,,,8,8C>D7F+@C+6@BFF?8EE5C>E7>BC8CDG788C88?F?F6?:75C:CF:7CG97+:<9/:7C*=5

##########################################################################################################################

R2

@M01598:61:000000000-A723A:1:1101:21164:1373 2:N:0:1 CTTGCGCTTCCCAGGNGAGGCAATGCCTTGCCCTGCTTCTGCTCGTGCCCGGTGCGCGCACCCACTGGCCTGCGCCCANTTTCTGTCACTCCCTAGTGAGATGAACNCAGTACCTCAGATGGAAATGCAGAAATCACNCGTCTTCTGCGTCCCTCGGGCTGGGAACTGTAGACCGGAGCTGTTCCTATTCGGCCATCTTGGCTCCTCTGAACTTCATCAGTNNNNCTAGNN + 888-8+@FGGGGF8,#,,,,:,-C;EFGF,CEFC<fefgg,cff7c6ffgc:fcfegdggeeggggg8fggffffgg:#4,:c=,,<?cefgffg8eaa9fc<ffg#:b+dbfdbfc,,:<<,,,:,9,,,,,ae,e#++8@e=efade@c+b=d7+3be@+8,@@d953,3@c<em>*4C,=2=CE,=<c,< em="">3:)2;1=,*29?DDD7,1:8::C+@F7FA ####22))##

Complement reverse R2

NNCTAGNNNNACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGGGACGCAGAAGACGNGTGATTTCTGCATTTCCATCTGAGGTACTGNGTTCATCTCACTAGGGAGTGACAGAAANTGGGCGCAGGCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCNCCTGGGAAGCGCAAG

###########################################################################################################################

PEAR Paired End reAd meRger

pear_assembled.fastq

@M01598:61:000000000-A723A:1:1101:21164:1373 1:N:0:1 CTGGCTGAATGAGACTGGTGTCGACACTAGTGGTACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGCGACGCAGAAGACTGGTGATTTCTGCATTTCCATCTGAGGTACTGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGCGCAGGCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCACCTGGGAAGCGCAAG + -8A<cgeffefc<fgg9@ccff,8c;nnqwc<,efl]h<code>P[ZZP^UPROcgJXQOQbW^PNRNLG\^\NghTQXQiXOPOPfVRKTY^dWbQ]Qe]]WP\iaCb_gfchOGPKCDA,PNPLKPLLKDNEMKJZRKJgQ[OYbOMgR8Gfk^fk^_fiXeQFQFPNVG8IHNE8?8\dj\kPeiGFMfhlle\ik[fdi\O[eP]OXf]]ZN^^djdaXdNLE[hNTeREPNRDB+ES^GPPPTIDNO8TL

ADD REPLY • link 10.6 years ago by am ▴ 60

0

Entering edit mode

this seems to be so simple that I dont know if I should write it, but is this what you are looking for?

s1 = "AGTGCACTAGTGGT"
s2 = "CACTAGTGGTATAA"
s3 = "AGTGCACTAGTGGTATAA"
overlap = len(s1)+len(s2)-len(s3)
print overlap

ADD REPLY • link 10.6 years ago by User000 ▴ 710