Question: Sorting and Counting Deep-Seq reads
gravatar for Allie J
20 months ago by
Allie J0
Allie J0 wrote:

Disclaimer: I have approximately 5 weeks of python experience.

I want to pair my sequencing reads based on flow cell location, sort the sequencing reads into separate files based on 8 nucleotide barcodes, and then count mutations at a specific codon.

Using Biopython's SeqIO to parse the reads and then create a dictionary:

seqID= {}
for record in SeqIO.parse('Sample.fq', 'fastq'):

Then paired the sample reads by creating a multi dictionary, where sequences found in the same location are assigned to the same value:

seqID_multi= {}
for k, v in seqID.items():
    seqID_multi.setdefault(v, set()).add(k)

And then printing each item in fasta format, based on its specific barcode:

for k, v in seqID_multi.items():
     for item in barcode_sequences:
        # if the item is included in the value,
        if item in str(v):
            # append an output file with the key followed by the paired values
            outputfile= open('%s.fasta'%barcode_names[barcode_sequences.index(item)], 'a')
            # print the seqID to that file
            print('>' + k, file=outputfile)
            for v in seqID_multi[k]:
                print(v, file=outputfile)

Finally, for just one of the barcodes, I tried to count the occurences of every possible codon at a specific site, based on a dictionary of codons, 'gencode':

list= []
for fasta in SeqIO.parse('K234_ns.fasta', 'fasta'):
    location, seq=, str(fasta.seq)
    read= location + seq
    for k, v in gencode.items():
        if'GGTGAT' + k + 'ACCGG(C|T)', read):

My output is as follows:

Counter({'K': 46826, 'R': 33260, 'G': 16681, 'V': 9701, 'L': 8478, 'S': 7530, 'A': 7303, 'T': 6098, 'C': 5076, 'E': 5063, 'W': 4792, 'P': 4117, 'N': 3798, 'I': 3512, 'M': 3475, 'D': 3240, 'Q': 3129, '_': 3031, 'Y': 2992, 'F': 2944, 'H': 2438})

The problem:

My naive libraries (like the one I just tallied up above) should be unbiased, with equal frequency of each AA. When my PI runs their program in Perl, they indeed get an unbiased distribution of AA frequency. What am I messing up here? Is it my code, or is my code fine and it must be my specific criteria?


ADD COMMENTlink modified 20 months ago • written 20 months ago by Allie J0

Although it's a nice exercise, you should be able to use software dedicated to barcode splitting etc. cutadapt for instance can handle demultiplexing.

ADD REPLYlink written 20 months ago by Asaf8.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour