Question: How To Calculate Overlap Read Lengths
1
gravatar for am
7.0 years ago by
am20
European Union
am20 wrote:

Hello, I used Pear to merge my Illumina (MiSeq sequencer) pair-end reads. I'd like to know the overlap length in each of the merged reads in order to calculate min length, max, average, mode, median etc... Can you suggest my how to do? I tried with a perl script (length R1 + length R2 - length MergedR1R2) but I am not very good in programming.... Can anybody help me and tell me how to do this? Thanks!!

overlap reads length • 4.6k views
ADD COMMENTlink modified 7.0 years ago by Neilfws49k • written 7.0 years ago by am20
1

can you post an example of input file, pls?

ADD REPLYlink written 7.0 years ago by User000440

I have 3 files in fastq format. R1.fastq, R2.fastq, pear_R1R2.assembled.fastq.

This is an example of what I am talking about. I show only the sequence (2nd line in the fastq file) R1 A G T G C A C T A G T G G T R2 C A C T A G T G G T A T A A Pear R1/R2 A G T G C A C T A G T G G T A T A A I want to extract the length of the overlap region C A C T A G T G G T ( 10 in this case) , for every read contained in the pear_R1R2.assembled.fastq file.

Thank you very much!

ADD REPLYlink written 7.0 years ago by am20
3
gravatar for User000
7.0 years ago by
User000440
User000440 wrote:

You can simply consider the two seq-s and find the overlapping region. Using python:

import difflib
s1 = "AGTGCACTAGTGGT"
s2 = "CACTAGTGGTATAA"

def overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    a, b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[a:a+size]

print len((overlap(s1, s2)))
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by User000440

Thank you! this very useful but I think I wasn't clear about what I was asking. I would like to know if Pear respected the parameters i gave it (min overlap length= 50 etc..) inspecting the length of the overlap region contained in every merged read contained in pear_R1R2.assembled.fastq file. Sorry for my english. :)

ADD REPLYlink written 7.0 years ago by am20

Sorry, I dont understand very well your point. The problem is your pear file is going to output an assembled sequence not the overlapping one, how can you extract from it the overlap region if there are no input seq-s? May be I am missing something. But I think you can first find overlapping regions separately and then calculate min length, max, average, mode, median etc.

ADD REPLYlink written 7.0 years ago by User000440

You are so rigth. This is the reason why I was thinking about doing something like length of read1 + length of read 2 - length of assembled read 1/2. So if R1 A G T G C A C T A G T G G T length is 14 and R2 C A C T A G T G G T A T A A length is 14 and their sum is 14+14= 28 while Pear R1/R2 A G T G C A C T A G T G G T A T A A is 18 bp long, the overlap region ( ~ shared by the the two reads) would be 28 -18 = 14 bp long C A C T A G T G G T. I am not looking at single base identity comparing the sequences because I know that during the assembly Pear chose the base with the best quality score according to its position. "Now, assume two reads x and y can be merged and have an overlapping region C. PEAR will correct errors in the overlapping region and compute updated quality scores for the overlap. For every pair of corresponding observed bases X and Y in C and their quality scores eX and eY,, respectively, we distinguish four cases—X and Y are identical, different, one of them is uncalled or both of them are uncalled. When the two bases are identical, PEAR simply inserts this base into the corresponding position in the merged sequence and assigns the product of the quality scores: because errors are independent from each other (see Section 2.1). When the base pairs are different, PEAR inserts the base with the highest quality score and the corresponding quality score. If (only) one of the two bases is uncalled (N), PEAR uses the called base and its quality score. Finally, if both bases are uncalled, we arbitrarily use the lower of the two quality scores, as a quality score is required to obtain a valid FASTQ output file. " This reasoning maybe doesn't make sense. I'll think about it.

ADD REPLYlink written 7.0 years ago by am20

can you post one of your fastq file example? I want to see the format

ADD REPLYlink written 7.0 years ago by User000440

R1

@M01598:61:000000000-A723A:1:1101:21164:1373 1:N:0:1 CTGGCTGAATGAGACTGGTGTCGACACTAGTGGTACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGCGACGCAGAAGACTGGTGATTTCTGCATTTCCATCTGAGGTACTGGCTTCATCTCACTACGTAGTGCCAGACAGTGGGCGCACTCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCACCTCGGAAGCGCCAG + -8A<cgeffefc<fgg9@ccff,8c;ff@fc&lt;,efggcaf9aa9eee&lt;,@d,@@fffgd?f9<c&lt;:c@ccf8@<fgefgfgd7@@befa8cffgf9<efffd@c@@cccc++,,,:a,,c,,@eaa@,c,b@??e9@?e,:,4a,cf98,afcdff?fea?,,+,,,8,8>,,,8,8C>D7F+@C+6@BFF?8EE5C>E7>BC8CDG788C88?F?F6?:75C:CF:7CG97+:<9/:7C*=5

##########################################################################################################################

R2

@M01598:61:000000000-A723A:1:1101:21164:1373 2:N:0:1 CTTGCGCTTCCCAGGNGAGGCAATGCCTTGCCCTGCTTCTGCTCGTGCCCGGTGCGCGCACCCACTGGCCTGCGCCCANTTTCTGTCACTCCCTAGTGAGATGAACNCAGTACCTCAGATGGAAATGCAGAAATCACNCGTCTTCTGCGTCCCTCGGGCTGGGAACTGTAGACCGGAGCTGTTCCTATTCGGCCATCTTGGCTCCTCTGAACTTCATCAGTNNNNCTAGNN + 888-8+@FGGGGF8,#,,,,:,-C;EFGF,CEFC<fefgg,cff7c6ffgc:fcfegdggeeggggg8fggffffgg:#4,:c=,,<?cefgffg8eaa9fc<ffg#:b+dbfdbfc,,:&lt;&lt;,,,:,9,,,,,ae,e#++8@e=efade@c+b=d7+3be@+8,@@d953,3@c<em>*4C,=2=CE,=<c,< em="">3:)2;1=,*29?DDD7,1:8::C+@F7FA ####22))##

Complement reverse R2

NNCTAGNNNNACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGGGACGCAGAAGACGNGTGATTTCTGCATTTCCATCTGAGGTACTGNGTTCATCTCACTAGGGAGTGACAGAAANTGGGCGCAGGCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCNCCTGGGAAGCGCAAG

###########################################################################################################################

PEAR Paired End reAd meRger

pear_assembled.fastq

@M01598:61:000000000-A723A:1:1101:21164:1373 1:N:0:1 CTGGCTGAATGAGACTGGTGTCGACACTAGTGGTACTGATGAAGTTCAGAGGAGCCAAGATGGCCGAATAGGAACAGCTCCGGTCTACAGTTCCCAGCCCGAGCGACGCAGAAGACTGGTGATTTCTGCATTTCCATCTGAGGTACTGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGCGCAGGCCAGTGGGTGCGCGCACCGGGCACGAGCAGAAGCAGGGCAAGGCATTGCCTCACCTGGGAAGCGCAAG + -8A<cgeffefc<fgg9@ccff,8c;nnqwc&lt;,efl]h<code>P[ZZP^UPROcgJXQOQbW^PNRNLG\^\NghTQXQiXOPOPfVRKTY^dWbQ]Qe]]WP\iaCb_gfchOGPKCDA,PNPLKPLLKDNEMKJZRKJgQ[OYbOMgR8Gfk^fk^_fiXeQFQFPNVG8IHNE8?8\dj\kPeiGFMfhlle\ik[fdi\O[eP]OXf]]ZN^^djdaXdNLE[hNTeREPNRDB+ES^GPPPTIDNO8TL

ADD REPLYlink written 7.0 years ago by am20

this seems to be so simple that I dont know if I should write it, but is this what you are looking for?

s1 = "AGTGCACTAGTGGT"
s2 = "CACTAGTGGTATAA"
s3 = "AGTGCACTAGTGGTATAA"
overlap = len(s1)+len(s2)-len(s3)
print overlap
ADD REPLYlink written 7.0 years ago by User000440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2102 users visited in the last hour
_