Question: look for sequences containing a specific motif
gravatar for Florian BERNARD
13 months ago by
Florian BERNARD20 wrote:

Hi everyone,

I am sure this is rather easy to do but I'm no bioinformatician and I've been struggling with this for the last few days so here I am finally asking for your help/advices.

I have generated RNA-seq data with the minion from nanopore and I'm now trying to analyze them. From the fastq file I obtained, i performed alignment of my reads with minimap2 onto my genome of reference and obtained BAM/SAM files. Now, I am trying to determine the number of reads aligned on the genome that contains a specific motif of 24bp (normally located at the 5' end of all of my RNAs). I was able to write a very basic python script to search for the EXACT sequence (or the reverse complement) but could only get something like 50k hits among 1.5millions reads (stored in a multi-fasta file).

    from Bio.Seq import Seq
from Bio import SeqIO

SEQ = Seq("XXXXXXXXXXXXXXXX") #Sequence to search for
SEQrev = SEQ.reverse_complement() 

reads = [] 

for rec in SeqIO.parse("reads.fasta","fasta"): 
    if rec.seq.count(SEQ) == 1:              
    if rec.seq.count(SEQrev) == 1:

print("there is " + str(len(reads)) + " sequences containing SEQ")

Knowing the minion is prone to errors I would like to find a way to search for my sequence allowing mismatches. I've been looking at others posts and trying things like running a local blast (my multi-fasta file containing all my reads being my database) or using ClustalW but none seem to have worked so far (or I failed to use them correctly).

Could you give me any advice on how to do that easily or guide me toward a script that could allow me to do it ? If you tell me to insist with blast or clustal then I will really dig into it in order to fix my mistakes or find what is not working but I'd rather have your advice before starting spending time on something that could end up not being worth my time !

I am all learning by myself trough trials and errors (and a lot of internet posts) so a little push in the right direction would be very welcomed ! Thanks in advance.

ADD COMMENTlink modified 13 months ago by Alex Reynolds27k • written 13 months ago by Florian BERNARD20
gravatar for harold.smith.tarheel
13 months ago by
United States
harold.smith.tarheel4.3k wrote:

An easy method would be to use BBMap's kmer counting functionality, plus hamming distance to allow for errors: in=YOUR.FASTQ outm=MATCHED.FASTQ \
ADD COMMENTlink modified 13 months ago • written 13 months ago by harold.smith.tarheel4.3k

Thanks for your answer, seems like it's the perfect tool for that. I have one small problem though. I tried running this : in=reads.fasta outm=matches3.fasta literal=CTCAAACTTGGGTAATTAAACC k=22 hdist=3

And I also tried running this, with the reverse complement sequence : in=reads.fasta outm=matches1.fasta literal=GGTTTAATTACCCAAGTTTGAG k=22 hdist=3

And I also tried specifying both sequences at the same time with literal=GGTTTAATTACCCAAGTTTGAG,CTCAAACTTGGGTAATTAAACC but the 3 scripts gave me back the exact same number of sequences .. which I find very odd. Any suggestions or ideas as to why is that ?

ADD REPLYlink written 13 months ago by Florian BERNARD20

From the user guide:

You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag “rcomp=t” or “rcomp=f”; by default it looks for both.

ADD REPLYlink written 13 months ago by harold.smith.tarheel4.3k

I eventually figured that's what it was doing but I couldn't get why ! Thanks a lot, I had miss that part. Great tool you provided me here, thanks again !

ADD REPLYlink written 13 months ago by Florian BERNARD20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour