I am sure this is rather easy to do but I'm no bioinformatician and I've been struggling with this for the last few days so here I am finally asking for your help/advices.
I have generated RNA-seq data with the minion from nanopore and I'm now trying to analyze them. From the fastq file I obtained, i performed alignment of my reads with minimap2 onto my genome of reference and obtained BAM/SAM files. Now, I am trying to determine the number of reads aligned on the genome that contains a specific motif of 24bp (normally located at the 5' end of all of my RNAs). I was able to write a very basic python script to search for the EXACT sequence (or the reverse complement) but could only get something like 50k hits among 1.5millions reads (stored in a multi-fasta file).
from Bio.Seq import Seq from Bio import SeqIO SEQ = Seq("XXXXXXXXXXXXXXXX") #Sequence to search for SEQrev = SEQ.reverse_complement() reads =  for rec in SeqIO.parse("reads.fasta","fasta"): if rec.seq.count(SEQ) == 1: reads.append(rec) if rec.seq.count(SEQrev) == 1: reads.append(rec) print("there is " + str(len(reads)) + " sequences containing SEQ")
Knowing the minion is prone to errors I would like to find a way to search for my sequence allowing mismatches. I've been looking at others posts and trying things like running a local blast (my multi-fasta file containing all my reads being my database) or using ClustalW but none seem to have worked so far (or I failed to use them correctly).
Could you give me any advice on how to do that easily or guide me toward a script that could allow me to do it ? If you tell me to insist with blast or clustal then I will really dig into it in order to fix my mistakes or find what is not working but I'd rather have your advice before starting spending time on something that could end up not being worth my time !
I am all learning by myself trough trials and errors (and a lot of internet posts) so a little push in the right direction would be very welcomed ! Thanks in advance.