Question: Mapping fasta/fastq reads against a long fasta sequence using grep
0
gravatar for Seq225
6 weeks ago by
Seq22590
Seq22590 wrote:

I have read file (~25Nt; fasta or fastq formate). I want to extract the reads that map against a long gene sequences (2Kb, fasta formate). I want to use grep. I can use bowtie or other tools. But, I want to use grep. Anybody can help me with the command, please? I want want to allow one or two mismatches per read.

Thanks!!

ADD COMMENTlink written 6 weeks ago by Seq22590
1

What have you tried so far then? Use reads to search since your reference is longer.

ADD REPLYlink written 6 weeks ago by genomax85k

I actually have not tried anything yet. My target fasta is a long sequence ( I actually have several sequences of different legths, I want to search them separately). The read file has thousands of reads (or tens of thousands) Thanks!

ADD REPLYlink written 6 weeks ago by Seq22590
3

Then you should consider using a proper aligner. bbmap.sh can use both fasta and fastq reads to do the search.

ADD REPLYlink written 6 weeks ago by genomax85k

Thanks. I will try bbmap.sh.

ADD REPLYlink written 6 weeks ago by Seq22590
1

If your target sequence is a Fasta file, you will have new-lines every 80-100 nt, that will break your grep regex, also allowing mismatches will be not good in grep (where do you put them?)

You can use the fasta aligner to properly search without indexing

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by JC10k

Thanks. If there is no line break in the fasta sequence? Suppose it is a long line. I can also split it into ~80 Nt and then want to find the reads that match.

ADD REPLYlink written 6 weeks ago by Seq22590
1

You can linearize the fasta easily : Linearize fasta files

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax85k
1

even with the sequence in a single line, you will have the problem of the mismatches, grep is great when you want an exact hit, but if you want to allow 1-2 mismatches that becomes complex, for example, to search ACGT with up to2 mismatches, you need to search:

ACGT
.CGT
A.CGT
AC.T
ACG.
..GT
A..GT
A.C.
A..T
.C.T
AC..
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by JC10k

Thanks. I think then my option would be to do only perfect matches.

ADD REPLYlink written 6 weeks ago by Seq22590
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1896 users visited in the last hour