Question: How to find both primers on reads?
0
gravatar for xatabadich
9 months ago by
xatabadich0
xatabadich0 wrote:

Hello everyone!

I developed a little script to search for two primers in reads which are converted into .fa format. so the script is able to find both primers in one read using bbduk.

The problem is that the script is not as efficient as it could be. It takes a lot of time to process. According to the code below(Python) it searches for the first primer in reads that contain the primer, then deletes the initial file that contains all the reads, after that it searches for the second primer and gives and an empty files that do not contain both primers.

The question is: how can I improve the processing speed by changing the code or using a different approach. Or does something similar already exist as an example?

 #!/usr/bin/env python3
import csv
import os


#determine tool directories
bbduk = '/home/vladislav.shevtsov/miniconda3/envs/mlvapython3/bin/bbduk.sh'

#determine files directories
data_in_dir = '/home/vladislav.shevtsov/bbduk_search/reads'
data_out_dir = '/home/vladislav.shevtsov/bbduk_search/output/'

fastq_gz_row_list = sorted([os.path.join(data_in_dir, i) for i in os.listdir(data_in_dir) if i.endswith('.gz')])

def generate_bbduk_output_name(name, suffix = 'OUT'): #Generates the output name
    a, b = os.path.splitext(name)
    new_name = a + '_'+ suffix + b
    return new_name

doc='/home/vladislav.shevtsov/bbduk_search/Primers_Pseudo_7fix_run.txt' # Path to Primers.txt



bash_file = 'bbdu1.sh'
with open(bash_file, 'w') as f, open(doc,'r') as primer:
    file_reader=csv.reader(primer,delimiter='\t')
    f.write('#!/bin/bash \n')
    for lines in file_reader:

        for sample in fastq_gz_row_list:
            name = os.path.basename(sample)
            name, _ = os.path.splitext(name)
            out_sample = os.path.join(data_out_dir, name)
            out_sample = generate_bbduk_output_name(out_sample, lines[0])
            out_sample2 = generate_bbduk_output_name(out_sample)
            #print(out_sample)
            if not os.path.isfile(out_sample):                
              f.write(f'{bbduk} in={sample} outm={out_sample} literal={lines[1]} mm=f k=22 minlen=1 hdist=1 forbidn=t threads=18 \n') #searches primer 1 ... run_1
              f.write(f'{bbduk} in={out_sample} outm={out_sample2} literal={lines[2]} mm=f k=10 minlen=1 hdist=1 forbidn=t threads=18 \n') #searches primer 2 from previous run
              f.write(f'rm {out_sample} \n' ) #removes run_1 files
ADD COMMENTlink written 9 months ago by xatabadich0

Please a representative input an output example. People are typically relucant to go through multiple lines of your just tounderstand the problem. A good illustration always helps.

ADD REPLYlink written 9 months ago by ATpoint40k

Thank you for the advice. the input file for Primers.txt looks like this:

INPUT FILE 1

Ft-M1_3bp_83bp_3u TTAGCAGCCGCGATTACATCTATCAG TAATATTAAAATCATAGAATTTTAAAC

Ft-M2_6bp_90bp_4u TTTAATTCTTTTCATAATAAATTTTGT TTTCAATAACATTCATTTTAAAAAGA

.........

INPUT FILE 2

The script also needs reads.fastq.gz as an input in input folder.

OUTPUT FILES

The output files contain all the reads with both primes in a single read

SRR942045.1900911 1900911/1 AAATTCAGTAATTGATCTGATCTTACCAGTAACTTTATCACCTGGTTTGTAGTTTTTCTCAAACTCGCTCCAAGGATTAGGTCTGCATTGCTTGATACCAAGAGATATTCTGTGGTTATCAGCATCTAGTTCAAGTACGATAACTTCAACTTCTTGACCAATAGATACAGCTTTATGAGGGTTAACGTTTTTGTTTGTCCAATCCATTTCAGCTGTATGAACAAGACCTTCGACACCTTCTTTTAACTTAA + ?????BBB?BBBBBBBF;??CC>CFFHFCAEEGHHHHFFAEFFFHHFHEHHHGHHHFFF=??EFDEE@DDC=CE;,CFGGHHHHHFGFHGHBFGDHHEHHFFF?FDGHDFFHBGF7<dgccf?ccfh.@7ddfhdfdfee+=@?f,=ddee;b,5,=;be;?b,?;?;;cbc?ee?,;bb83,*0:8:a*))'00*:aceeceeaca??cc:**: *0*0="" :**="" ??*="" 0**).')*="" ******0:c*0="" @srr942045.1903979="" 1903979="" 1="" atcttgtggtgagcaacaaaagctacgcttgtgtaaaatatttaccaaagcttatgacttgattttgcttgatgaggcaacttcaaatatagatgcaaaatcagagaaaaaatctatcaactacttaaggataaaggtctagcttatatctcagttagtcataatgatagagttaaagcatatcatgataaagttattgagcttactagtgattgactagaagatgaacgataatactcgcaaactaccca="" +="" ?????bbbdbddddddffffffhfcfcefhhdgddfdgfgbghhhhfhhffhhhg@dhbbghhfhhhhhhhhhhhhh@&gt;cfhhhhhcf="/A7?GFG/C?FDGHDFFF=C-CEFFHHCF?DFFFHHFFF.BDFFFFFD==DD=DBBDEBDDBEFFC=B,5=ABEBCEBAEEEEEFABEE=5ACAAAEFFF:CEE?&lt;em">11::::A?AEA::CEEFEFEEAAE:::A?C888??1?::88'85000 @SRR942045.1908205 1908205/1 ATGTAACTACTTGACCGCCAGTATATGCTTGACCTTGACTCCATGCCTCACCTGAAACAAAAGGTTTCTTCCATGCGCCACCATCAGCCGGATTATTACCTTGTGTCCACCATTGAGCTTCATAAGTCACACCATTAACAATAACTTTATCACCCTTATTGTAGAC + ??A??BBBEEDDDDDDGGGGGGIIIIIIHIHIIIIIHIIIIIIIHIIIIIIHIHIIIIIIIIIHHHGGHIHIIIHIHHEEHHHGHFHGHHHCHHHHHHHHHHHHFHHHHHHHHGGGGGGGGFGGGGGGGGGGGGEGGGGGEGGGGGGGGGGGEGGGGEEGEEGGGG @SRR942045.1917232 1917232/1 TCTTCTAGTCAATCACTAGTAAGCTCAATAACTTTATCATGATATGCTTTAACTCTATCATTATGACTAACTGAGATATAAGCTAGACCTTTATCCTTAAGTAGTTGATAGATTTTTTCTCTGATTTTGCATCTATATTTGAAGTTGCCTCATCAAGCAAAATCAAGTCATAAGCTTTGGTAAATATTTTACACAAGCGTAGCT + AAAAABBBDDDEDEDDGGGGGGIIHHIIIIIIIIIIHIIIHIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIHHIIHIIIIIDGHIIIHIIHHHHFHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGEEGGGGEEGBEGGGEGGGGGGGGGCEEGCEEEAACACEGG8CC @SRR942045.1920897 1920897/1 GTTGCCATACCCTCGATAACATCACCATCACAAGGTATCAAAGTATTTGTCTCAACAATAACTGTATCACCTATTTTAAGGCTTTCAGCATCAACTTTTGTGACACTACCATCTTCTTCTATTCTTAAGGCAAAGAGTTTAGATTTAGCAGCTTTAAGAGTATCTGCTTGAGCCTTACCTCTTCCTTCAGCGATACCTTCAGCAAAGTTTGCAAATAAAATTGTAAACCATAACCATATAACTACCTG + ?????BBBDDDD9A@ACCEFFFHHFHFFHHIIIFH9AEGHHIBGFHHH@FFEGHFFHFFFGGHHHHIHHHHDEFFGHHIDGHHIFHGHHHBGCGHHDFHFFHIIIIIIHFHHHGHHHFHFHHBFHFHHHHHFHHHFFBBDFFFDDFFFEEEEDEEFFFFEDEEECEBEEEEFBEECBC,ACEEBB=CAACEEDFA;:C:A:AAEEE0:CEE:?:A?C:::?::?:?8::C??:C?CCE:?:A @SRR942045.1933571 1933571/1 GATCTTACCAGTAACTTTATCACCTGGTTTGTAGTTTTTCTCAAACTCGCTCCAAGGATTAGGTCTGCATTGCTTGATACCAAGAGATATTCTGTGGTTATCAGCATCTAGTTCAAGTACGATAACTTCAACTTCTTGACCAATAGATACAGCTTTATGAGGGTTAACGTTTTTGTTTGTCCAATCCATTTCAGATGTATGAACAAGACCTTCGATACCTTCTTTTAACTTAACAAAACAACCGTAGTCAG +

ADD REPLYlink modified 9 months ago • written 9 months ago by xatabadich0

Depending on how many of these sequence you have in your primer file and the size of your dataset this is how long it is going to take (assuming your python code is not adding any additional overhead, not sure why you are not running bbduk.sh directly, it will take a file of fasta format sequences as input, doing it the way you are it is running each sequence individually so the files have to be read over again and again). You could also try giving bbduk.sh some additional memory by -Xmx10g to see if that helps.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2090 users visited in the last hour