Question

Software To Search Nucleotide Sequence Data By Regular Expressions

0

Entering edit mode

12.6 years ago

User 4391 ▴ 100

Hi,

I want to search the nucleotides which contain in the reference sequences data, now. And I want to search it with wild card.

For example

Sequence data      :   AGCTAGCTAGGCTAGCGGCTTTGGCGCCTAGCCAGA
Search Nucleotide:   TAGC
Or wild card :           TA*C, TA#C,TANC 
Result          :           I can know where search nuleotide or wild card contain in reference sequence.

I just know one software "genome traveler". Do anyone know more software, please show me? Thank you so much.

sequence • 4.0k views

ADD COMMENT • link updated 6.7 years ago by Vladimir Mikryukov ▴ 20 • written 12.6 years ago by User 4391 ▴ 100

Ram · Answer 1 · 2011-10-03

This is a basic regular expression, many programming languages have an implementation of these: Java, Perl, Python, R, awk... The regexp pattern matching you describe is: TA.C

There are some ready to use tools in EMBOSS for that too, that's possibly better and more reliable, than a self-made solution. dreg and fuzznuc (fuzzy search), see here: http://manuals.bioinformatics.ucr.edu/home/emboss#searching

Ram · Answer 2 · 2011-10-03

3

Entering edit mode

12.6 years ago

Pierre Lindenbaum 161k

The EMBOSS package contains a tool named dreg:

This searches for matches of a regular expression to a nucleic acid sequence.

A regular expression is a way of specifying an ambiguous pattern to search for. Regular expressions are commonly used in some computer programming languages and may be more familiar to some users than to others.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 12.6 years ago by Pierre Lindenbaum 161k

score 1 · Answer 3 · 2011-10-03

1

Entering edit mode

12.6 years ago

Martin A Hansen 3.0k

Have a look at scan_for_matches - the best pattern scanner out there.

ADD COMMENT • link 12.6 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Do you have evidence for that it is the best scanner out htere, or is it just your opinion?

ADD REPLY • link 12.6 years ago by Michael 54k

0

Entering edit mode

Do you have evidence for that it is the best scanner out there, or is it just your opinion?

ADD REPLY • link 12.6 years ago by Michael 54k

0

Entering edit mode

SFM is truely awesome IMHO. It is almost as fast as agrep and faster than nrgrep, but SFM is much more flexible. I have used it for many years.

ADD REPLY • link 12.6 years ago by Martin A Hansen 3.0k

score 0 · Answer 4 · 2017-08-16

If your data are in FASTA or FASTQ format (plain or gzipped) you may try to search nucleotide sequences by regular expressions with seqkit.

Here is an example:

## Dummy data
cat > input.fasta <<'EOT'
>seq1
AAAAAAAAAAAAAAAA
>seq2
AAGCGAATCGTGTGTG
>seq3
AAGCGAATCGAATGTG
>seq4
AAGCGAATCCAATGTG
EOT

# regex
seqkit grep -s -r -p "(G|C)A?T*A" input.fasta

# IUPAC degenerated nucleotide codes
seqkit grep -s -d -i -p RYSAA input.fasta

Flags meaning:

-s: search for the pattern in sequences

-r: patterns are regular expression

-d: pattern/motif contains degenerate bases

-i: ignore case

-p: search pattern (multiple values supported)