I have some fastq files with sequences which I reconstructed by merging forward and reverse reads using bbmerge. The 5' and 3' end of said reads must contain a 6bp inline barcode for contamination control.
I'm looking for a package to extract the 6bp at the beginning and end of the sequence, match them to my database of barcodes while allowing for a set number of nucleotide mismatched (in case of amplification errors), and report:
- If no recognizable barcodes are present in sequence.
- If barcodes are present and match. If there are mutations in the barcodes, the script must report the number of mismatches.
- If barcodes are present but do not match. Again, if there are mutations in the barcode, the script must report the number of mismatches.
- The size of the insert if barcodes are present.
I can write a bio python script to perform this procedure, but before reinventing the wheel I'd like to know if there a packages that already implemented and standardized the analyses of inline barcodes and perform the tasks described above.