Question: Find pattern that is present twice and allow <=2 mismatches on each
0
gravatar for nafizh
5 weeks ago by
nafizh0
nafizh0 wrote:

I have a fastq file of 400,000 reads (so speed is important). In the sequences there are barcodes integrated that should be present twice. Given a barcode, I want to find the sequences that have the barcode present twice with <= 2 mismatches. So, with a barcode 'ATTCGACCGATAGG', I would like to retrieve all of the following sequences-

>TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA

Note that the first barcode in the fourth sequence is short of one character. I have tried with biopython and regex but it's just too slow given I have 5k barcodes. I am wondering if there is a fast solution available in python or in something like grep, awk or anything else. Thanks.

barcode grep awk fastq python • 123 views
ADD COMMENTlink modified 5 weeks ago by cpad011215k • written 5 weeks ago by nafizh0
1

Use cutadapt and control the error rate. Please read cutadapt manual for parameter explanation:

$ cutadapt --action=none --trimmed-only -g ATTCGACCGATAGG...ATTCGACCGATAGG input.fq

edit: edited for fastq, instead of fasta

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011215k

Thanks for the reply. Does cutadapt allow for <=n mismatches on the barcodes?

ADD REPLYlink written 5 weeks ago by nafizh0
1

Cutadapt allows maximum error rate or number of mismatches (n) per matched index sequence. Please read cut adapt manual on error rate.

ADD REPLYlink written 5 weeks ago by cpad011215k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2104 users visited in the last hour
_