Disclaimer: I have approximately 5 weeks of python experience.
I want to pair my sequencing reads based on flow cell location, sort the sequencing reads into separate files based on 8 nucleotide barcodes, and then count mutations at a specific codon.
Using Biopython's SeqIO to parse the reads and then create a dictionary:
seqID= {}
for record in SeqIO.parse('Sample.fq', 'fastq'):
seqID[str(record.seq)]= record.id
Then paired the sample reads by creating a multi dictionary, where sequences found in the same location are assigned to the same value:
seqID_multi= {}
for k, v in seqID.items():
seqID_multi.setdefault(v, set()).add(k)
And then printing each item in fasta format, based on its specific barcode:
for k, v in seqID_multi.items():
for item in barcode_sequences:
# if the item is included in the value,
if item in str(v):
# append an output file with the key followed by the paired values
outputfile= open('%s.fasta'%barcode_names[barcode_sequences.index(item)], 'a')
# print the seqID to that file
print('>' + k, file=outputfile)
for v in seqID_multi[k]:
print(v, file=outputfile)
outputfile.close()
Finally, for just one of the barcodes, I tried to count the occurences of every possible codon at a specific site, based on a dictionary of codons, 'gencode':
list= []
for fasta in SeqIO.parse('K234_ns.fasta', 'fasta'):
location, seq= fasta.id, str(fasta.seq)
read= location + seq
for k, v in gencode.items():
if re.search('GGTGAT' + k + 'ACCGG(C|T)', read):
list.append(v)
print(Counter(list))
My output is as follows:
Counter({'K': 46826, 'R': 33260, 'G': 16681, 'V': 9701, 'L': 8478, 'S': 7530, 'A': 7303, 'T': 6098, 'C': 5076, 'E': 5063, 'W': 4792, 'P': 4117, 'N': 3798, 'I': 3512, 'M': 3475, 'D': 3240, 'Q': 3129, '_': 3031, 'Y': 2992, 'F': 2944, 'H': 2438})
The problem:
My naive libraries (like the one I just tallied up above) should be unbiased, with equal frequency of each AA. When my PI runs their program in Perl, they indeed get an unbiased distribution of AA frequency. What am I messing up here? Is it my code, or is my code fine and it must be my specific criteria?
Thanks.
Although it's a nice exercise, you should be able to use software dedicated to barcode splitting etc. cutadapt for instance can handle demultiplexing.