Entering edit mode
                    4.9 years ago
        Apex92
        
    
        ▴
    
    320
    I have a big fasta file that I want to select reads if contain any substrings based on two dictionaries (in the structure below).
casp_fw1 ={'P_fw1':'CCAGGTAGTAGTACGTCTGTT','P_fw1_rc':'AACAGACGTACTACTACCTGG',
         'A_fw1':'CGGCCTCCTCCAATTAAAGAA','A_fw1_rc':'TTCTTTAATTGGAGGAGGCCG'}
casp_fw2 ={'P_fw2':'CGTGCTGCGAGAGTATTATCT','P_fw2_rc':'AGATAATACTCTCGCAGCACG',
            'A_fw2':'CGTGAACCGTTATTTGGGTAC','A_fw2_rc':'GTACCCAAATAACGGTTCACG'}
This is my code but I noticed that I miss some of the reads even though the have substrings from the two dictionaries. Could somebody please help to optimize the code? I afraid that 3 for loops cause the problem.
from Bio import SeqIO
count=0
with open('1_S1_L001_R1_001.fasta','r') as f:
    for i in SeqIO.parse(f,'fasta'):
        for k1,v1 in casp_fw1.items():
            for k2,v2 in casp_fw2.items():
                if v1 in i.seq and v2 in i.seq:
                    casp_fw1_pos = i.seq.find(v1) + 1
                    casp_fw2_pos = i.seq.find(v2) + 1
                    if casp_fw1_pos < casp_fw2_pos:
                        distance = casp_fw2_pos - casp_fw1_pos
                    if casp_fw1_pos > casp_fw2_pos:
                        distance = casp_fw1_pos - casp_fw2_pos
                    print("sample_1")
                    printi.id)
                    print(i.seq)
                    print(str(k1) + " : " + str(v1) +  " - The position of this fw 21nt is " + str(casp_fw1_pos))
                    print(str(k2) + " : " + str(v2) +  " - The position of this rv 21nt is " + str(casp_fw2_pos))
                    print("Distance is " + str(distance))
                    print('\n')
                    count+=1
    print("The total number of reads that have both 21nt protein-specific sequences is " + str(count))
                    
                
                
Curious if you were not able to get
bbduk.shto work as discussed in a prior thread you had posted.Hey genomax, I ran it but the output was not exactly as I was looking for - so did the mapping otherway around with making each fasta file as a separate reference and the custom reference file as the input - with this I was able to align each sample with the custom reference
Is understanding the position of the match necessary - it's not clear to me from your question? if you want to select sequences based on solely containing those subsequences you simply need to do
if subseq in sequence