How to quantify the overlapping reads in paired-end DNA sequencing to check the sequencing efficiency
4
0
Entering edit mode
4.6 years ago
cjgunase ▴ 30

Hi All, I am a newbie to sequencing technologies.

Is there a way to quantify the paired-end sequencing overlap. since completely overlapping the read pairs would be a waste of sequencing resources. A bit of overlap can be useful when doing the alignment but a small gap is optimal to maximize coverage.

Is there a way to check sequencing efficiency by using the alignment files. Because if there is lots of overlap we want to improve this.

Any help is appreciated.

Thank you.

sequencing • 5.9k views
0
Entering edit mode

but a small gap is optimal to maximize coverage.

no agree: If the two reads overlap, it usually means that the sequenced fragment was too short.

in paired-end sequencing, you'd better consider the sequencing depth to "maximize coverage".

2
Entering edit mode
4.6 years ago
Renesh ★ 2.1k

Yes, you can do this with alignment BAM/SAM file. You can extract the record for concordant alignment YT:Z:CP from SAM/BAM file. Once you have concordant alignment, you can look for field 9 in SAM/BAM file. Field 9 (9th column) represents the fragment length of paired-end sequences.

From here, you can get the fragment length distribution. Based on your sequencing protocol, you should have the insert size for paired-end sequences. Then you should compare the fragment length with the insert size. If the fragment length is less than the sum of two reads, it means your paired sequences are overlapped. Here, you can plot the histogram of fragment length distribution.

Note: concordant alignment record will give you the alignments which are within the given insert size

0
Entering edit mode

sorry, I am confused. You said the to compare the fragment length with insert size. then how to check it is less than sum of two reads. I am new to this so sorry if this is very simple thing that i am missing.

1
Entering edit mode
FRAGMENT ========================================


the Bam file contains the genomic position of the read as well as the sequence(=LEN of the read)

0
Entering edit mode

So if Insert Length (N) and read 1( r1) and read 2( r2). (r1+r2 - N) should give the length overlapping bases. (+) lengths for overlap. so we can plot a histogram of the frequency of overlapped base lengths. For high quality, this should be a right-skewed distribution.

0
Entering edit mode
1
Entering edit mode

0
Entering edit mode

so based on the diagram above should consider insert length(which is from BAM file) NOT the fragment length?

0
Entering edit mode

The term fragment length from BAM file corresponds the total size covered by paired-end reads which may or may not equal to insert size.

2
Entering edit mode
4.6 years ago
h.mon 34k

In addition to the mapping statistics from bam files suggested above, you can calculate an (biased) estimate of the overlap from the fastq files buy simply merging R1 and R2, e.g., with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt

1
Entering edit mode
4.6 years ago
Renesh ★ 2.1k
0
Entering edit mode
4.6 years ago
igor 12k

You can calculate the overlap by comparing the read length to the fragment/insert size. There are a few ways to do that. See this previous discussion: Is It Possible To Get Fragment Length, Read Length And Number Of Fragments From A Bam/Sam File