EDIT

Question

Hamming distance of paired-end data from target amplicon sequencing

0

Entering edit mode

6.5 years ago

jefferson.jss • 0

Hi all,

I am analyzing paired-end data of amplicon libraries from a target region of a viral gene. Briefly, I ordered a PCR product where I designed degenerated codons at specific positions (let's say at 4 different positions) that are flanked by conserved nt sequences. I am interested in looking at the dynamics of gene variants (haplotypes/motifs) in this region under different experimental conditions. I prepared multiple amplicon libraries from this target region and the sequencing results look good. After adapter removal, I further trimmed and filtered out reads with a defined length using the conserved sequences flaking my region of interest.

With paired-end data of amplicon sequencing, match read pairs (read1/read2 or forward/reverse) should be complementary in sequence. So far, I have been working with read1/read2 separately (two fastq files, read1.fastq and read2.fastq). Before proceeding with mapping and variant calling, I want to compare these two fastq files and output only the match read pairs that are fully complementary. Could anyone offer some advice on how to accomplish this? I do not have much experience in programming, but I have looked at some posts where they use hamming distance to compare two strings. Could it be applied to compare two fastq files? Is there a more straight forward approach?

Thanks in advance.

SNP amplicon sequencing sequencing • 2.2k views

ADD COMMENT • link updated 6.5 years ago by Joe 21k • written 6.5 years ago by jefferson.jss • 0

0

Entering edit mode

6.5 years ago

Nicolas Rosewick 11k

You can check on this blog post there are multiple tools listed :

http://thegenomefactory.blogspot.be/2012/11/tools-to-merge-overlapping-paired-end.html

Here are the listed files from the blog post:

PEAR (Paired-End Read Merger) : http://sco.h-its.org/exelixis/web/software/pear/doc.html (* this is what I use)
COPE (Connecting Overlapping Paired End reads) : http://sourceforge.net/projects/coperead/
SeqPrep: https://github.com/jstjohn/SeqPrep
FLASH (Fast Length Adjustment of Short Reads to Improve Genome Assemblies): http://www.cbcb.umd.edu/software/flash
fastq-join (part of ea-utils): http://code.google.com/p/ea-utils/wiki/FastqJoin
PANDAseq: https://github.com/neufeld/pandaseq
stitch (now defunct, merged into PANDAseq): https://github.com/audy/stitch
mergePairs.py: http://code.google.com/p/standardized-velvet-assembly-report/source/browse/trunk/mergePairs.py

ADD COMMENT • link 6.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Thanks for the suggestions. I picked @genomax response because BBMap was already installed on the university cluster, but I will give PEAR a try.

ADD REPLY • link 6.5 years ago by jefferson.jss • 0

0

Entering edit mode

6.5 years ago

Joe 21k

This code that I wrote should get you most of the way there. It’s not quite finished and has some (many) errors, but the hamming distance function works.

EDIT

Now fully functioning code.

https://github.com/jrjhealey/bioinfo-tools/blob/master/Hamming.py

ADD COMMENT • link 6.5 years ago by Joe 21k

0

Entering edit mode

I intended it to be I mplemented to work on MSAs but you should be able to spot the important bits easy enough. Just needs the 2 sequences passing as strings to the function.

ADD REPLY • link 6.5 years ago by Joe 21k

0

Entering edit mode

Thanks. I will take a look at it.

ADD REPLY • link 6.5 years ago by jefferson.jss • 0

1

Entering edit mode

Just a note to say that I actually finished the script properly this evening so should be runnable:

$ python Hamming.py -A ATGATG -B ATGATA # If comparing 2 strings directly

$ python Hamming.py -a alignment.msa -f format # if passing in a multiple sequence alignment

ADD REPLY • link 6.5 years ago by Joe 21k

score 2 · Accepted Answer · 2017-11-17

2

Entering edit mode

6.5 years ago

GenoMax 142k

bbmerge.sh from BBMap suite should also be in the running here. @Brian (author of BBMap) recommends that you merge the reads first and then trim. This can be done using bbduk.sh. You could then use tadpole.sh to create assemblies of this genes or align to reference with bbmap.sh. All within BBMap suite. You can use hdist= parameter with many of these to allow a certain number of differences between the sequences.