Question: Find pattern that is present twice and allow <=2 mismatches on each
gravatar for nafizh
5 weeks ago by
nafizh0 wrote:

I have a fastq file of 400,000 reads (so speed is important). In the sequences there are barcodes integrated that should be present twice. Given a barcode, I want to find the sequences that have the barcode present twice with <= 2 mismatches. So, with a barcode 'ATTCGACCGATAGG', I would like to retrieve all of the following sequences-


Note that the first barcode in the fourth sequence is short of one character. I have tried with biopython and regex but it's just too slow given I have 5k barcodes. I am wondering if there is a fast solution available in python or in something like grep, awk or anything else. Thanks.

barcode grep awk fastq python • 123 views
ADD COMMENTlink modified 5 weeks ago by cpad011215k • written 5 weeks ago by nafizh0

Use cutadapt and control the error rate. Please read cutadapt manual for parameter explanation:

$ cutadapt --action=none --trimmed-only -g ATTCGACCGATAGG...ATTCGACCGATAGG input.fq

edit: edited for fastq, instead of fasta

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011215k

Thanks for the reply. Does cutadapt allow for <=n mismatches on the barcodes?

ADD REPLYlink written 5 weeks ago by nafizh0

Cutadapt allows maximum error rate or number of mismatches (n) per matched index sequence. Please read cut adapt manual on error rate.

ADD REPLYlink written 5 weeks ago by cpad011215k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2104 users visited in the last hour