Hamming distance of paired-end data from target amplicon sequencing
3
0
Entering edit mode
6.5 years ago

Hi all,

I am analyzing paired-end data of amplicon libraries from a target region of a viral gene. Briefly, I ordered a PCR product where I designed degenerated codons at specific positions (let's say at 4 different positions) that are flanked by conserved nt sequences. I am interested in looking at the dynamics of gene variants (haplotypes/motifs) in this region under different experimental conditions. I prepared multiple amplicon libraries from this target region and the sequencing results look good. After adapter removal, I further trimmed and filtered out reads with a defined length using the conserved sequences flaking my region of interest.

With paired-end data of amplicon sequencing, match read pairs (read1/read2 or forward/reverse) should be complementary in sequence. So far, I have been working with read1/read2 separately (two fastq files, read1.fastq and read2.fastq). Before proceeding with mapping and variant calling, I want to compare these two fastq files and output only the match read pairs that are fully complementary. Could anyone offer some advice on how to accomplish this? I do not have much experience in programming, but I have looked at some posts where they use hamming distance to compare two strings. Could it be applied to compare two fastq files? Is there a more straight forward approach?

Thanks in advance.

SNP amplicon sequencing sequencing • 2.2k views
ADD COMMENT
2
Entering edit mode
6.5 years ago
GenoMax 142k

bbmerge.sh from BBMap suite should also be in the running here. @Brian (author of BBMap) recommends that you merge the reads first and then trim. This can be done using bbduk.sh. You could then use tadpole.sh to create assemblies of this genes or align to reference with bbmap.sh. All within BBMap suite. You can use hdist= parameter with many of these to allow a certain number of differences between the sequences.

ADD COMMENT
0
Entering edit mode

Thanks. That's exactly what I was looking for.

ADD REPLY
0
Entering edit mode
6.5 years ago

You can check on this blog post there are multiple tools listed :

http://thegenomefactory.blogspot.be/2012/11/tools-to-merge-overlapping-paired-end.html

Here are the listed files from the blog post:

ADD COMMENT
0
Entering edit mode

Thanks for the suggestions. I picked @genomax response because BBMap was already installed on the university cluster, but I will give PEAR a try.

ADD REPLY
0
Entering edit mode
6.5 years ago
Joe 21k

This code that I wrote should get you most of the way there. It’s not quite finished and has some (many) errors, but the hamming distance function works.

EDIT

Now fully functioning code.

https://github.com/jrjhealey/bioinfo-tools/blob/master/Hamming.py

ADD COMMENT
0
Entering edit mode

I intended it to be I mplemented to work on MSAs but you should be able to spot the important bits easy enough. Just needs the 2 sequences passing as strings to the function.

ADD REPLY
0
Entering edit mode

Thanks. I will take a look at it.

ADD REPLY
1
Entering edit mode

Just a note to say that I actually finished the script properly this evening so should be runnable:

$ python Hamming.py -A ATGATG -B ATGATA # If comparing 2 strings directly

$ python Hamming.py -a alignment.msa -f format # if passing in a multiple sequence alignment
ADD REPLY

Login before adding your answer.

Traffic: 1348 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6