Question: Select sequences if contain two specific substrings
0
gravatar for Apex92
8 weeks ago by
Apex9220
Apex9220 wrote:

I have a big fasta file that I want to select reads if contain any substrings based on two dictionaries (in the structure below).

casp_fw1 ={'P_fw1':'CCAGGTAGTAGTACGTCTGTT','P_fw1_rc':'AACAGACGTACTACTACCTGG',
         'A_fw1':'CGGCCTCCTCCAATTAAAGAA','A_fw1_rc':'TTCTTTAATTGGAGGAGGCCG'}

casp_fw2 ={'P_fw2':'CGTGCTGCGAGAGTATTATCT','P_fw2_rc':'AGATAATACTCTCGCAGCACG',
            'A_fw2':'CGTGAACCGTTATTTGGGTAC','A_fw2_rc':'GTACCCAAATAACGGTTCACG'}

This is my code but I noticed that I miss some of the reads even though the have substrings from the two dictionaries. Could somebody please help to optimize the code? I afraid that 3 for loops cause the problem.

from Bio import SeqIO

count=0
with open('1_S1_L001_R1_001.fasta','r') as f:
    for i in SeqIO.parse(f,'fasta'):
        for k1,v1 in casp_fw1.items():
            for k2,v2 in casp_fw2.items():
                if v1 in i.seq and v2 in i.seq:
                    casp_fw1_pos = i.seq.find(v1) + 1
                    casp_fw2_pos = i.seq.find(v2) + 1
                    if casp_fw1_pos < casp_fw2_pos:
                        distance = casp_fw2_pos - casp_fw1_pos
                    if casp_fw1_pos > casp_fw2_pos:
                        distance = casp_fw1_pos - casp_fw2_pos
                    print("sample_1")
                    printi.id)
                    print(i.seq)
                    print(str(k1) + " : " + str(v1) +  " - The position of this fw 21nt is " + str(casp_fw1_pos))
                    print(str(k2) + " : " + str(v2) +  " - The position of this rv 21nt is " + str(casp_fw2_pos))
                    print("Distance is " + str(distance))
                    print('\n')
                    count+=1
    print("The total number of reads that have both 21nt protein-specific sequences is " + str(count))
ADD COMMENTlink modified 23 days ago by Biostar ♦♦ 20 • written 8 weeks ago by Apex9220

Curious if you were not able to get bbduk.sh to work as discussed in a prior thread you had posted.

ADD REPLYlink written 8 weeks ago by GenoMax95k

Hey genomax, I ran it but the output was not exactly as I was looking for - so did the mapping otherway around with making each fasta file as a separate reference and the custom reference file as the input - with this I was able to align each sample with the custom reference

ADD REPLYlink written 8 weeks ago by Apex9220

Is understanding the position of the match necessary - it's not clear to me from your question? if you want to select sequences based on solely containing those subsequences you simply need to do if subseq in sequence

ADD REPLYlink written 8 weeks ago by Joe18k
0
gravatar for shenwei356
23 days ago by
shenwei3565.7k
China
shenwei3565.7k wrote:

@Biostar bot pushes this post to frontpage ~

It looks like a case of amplicon sequencing data, for retrieving amplicon from SE or merged PE reads. We did this a lot, so I wrote a tool seqkit amplicon (link for usage and examples) for these kind of operations.

  • It supports retrieving amplicon sequence (or specific region around it) via primer(s).
  • Mismatch is allowed when matching primers and sequences, but the mismatch position is not controled, e.g., 3' end mismatch is possible in PCR.
  • Output can be FASTA sequence or BED6 format.
  • Multiple pairs of primers are also supported.
ADD COMMENTlink modified 23 days ago • written 23 days ago by shenwei3565.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2546 users visited in the last hour
_