Question: Finding a sequence in a fasta file
1
gravatar for mbk0asis
5.3 years ago by
mbk0asis460
Korea, Republic Of
mbk0asis460 wrote:

Hi, all!

Yesterday, I downloaded a human lncRNA fasta file from Genecode to find out if my presumably-novel-lncRNA-sequence exists in the database. I tried to find the sequence using simple linux commands like 'grep', and it seemed like it didn't exist. I don't want to jump into the conclusion, yet.... SO what I'm really looking for is any kinds of tools (e.g. scripts, UNIX commands, websites, and etc.) that can confirm my previous results. 

FYI, I'm not capable to make my own scripts.

Thank you!

alignment sequence • 11k views
ADD COMMENTlink modified 5.3 years ago by PoGibas4.8k • written 5.3 years ago by mbk0asis460
3

Does the fasta file have linebreaks? If yes, that might be the reason for your grep not working. You could try

cat your.fasta | tr -d "\n" | grep "TAGACAT"

Also, don't forget to search the reverse-complement or alternatively just blast your seq against the fasta.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by 5heikki8.5k

This will not report all instances of match.

This is a file called dummy.fa:

>123
agggaggttccatt
gaccttaggctaat
>345
gggagattcgtcta
tagatcagatcacg

cat dummyfile.fa | tr -d "\n"

will give you this:

>123agggaggttccattgaccttaggctaat>345gggagattcgtctatagatcagatcacg

now when you grep a pattern it will report only one line, which means you will not know how many sequences really had this motif. 

A small modification will fix this:

cat dummy.fa | tr -d "\n" | tr ">" "\n" | grep <regex>

 

but even this will not report multiple matches in the same sequence.

@OP what is that you really want to find

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Bharat Iyengar270

If you're using grep for sequences you should add "grep -i " 

ADD REPLYlink written 5.3 years ago by PoGibas4.8k

or run blast if you want to allow mismatches.

ADD REPLYlink written 5.3 years ago by Bharat Iyengar270
2
gravatar for hpmcwill
5.3 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

I am guessing that you are trying to find a fixed sequence string in the set of sequences you have downloaded?

If that is the case you might find a couple of the pattern matching tools in EMBOSS helpful:

  • dreg: Regular expression search of nucleotide sequence(s)
  • fuzznuc: Search for patterns in nucleotide sequences

As with all the EMBOSS programs, these understand various sequence formats, including the fasta sequence format you downloaded the data in.

 

ADD COMMENTlink written 5.3 years ago by hpmcwill1.1k

dreg definitely took care of me. Thank you.

ADD REPLYlink written 2.3 years ago by a.aiezza30
2
gravatar for pld
5.3 years ago by
pld4.8k
United States
pld4.8k wrote:

Just use BLAST or even BLAT to search your sequence against the know lncRNAs. All of the gencode data is mapped into ensembl. Ensembl has BLAST and BLAT support, so you can just go to their webpage to do this.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by pld4.8k
2
gravatar for dariober
5.3 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

I wrote a python program fastaRegexFinder.py to look for regular expressions in fasta files. The output is in bed format, so easy to parse with other tools (unix commands, bedtools etc.). See if it's any useful...

However, if your pattern (sequence) is long enough you might use an aligner like BLAST as joe.cornish826 suggested.

Example of fastaRegexFinder:

echo '>mychr' > /tmp/mychr.fa
echo 'GTAAACACNNNNNGTAAACAANNNNTTGTTTAC' >> /tmp/mychr.fa
   
fastaRegexFinder.py -q -f /tmp/mychr.fa -r '[G|A]TAAA[C|T]AA?'

Output

mychr    0    7    mychr_0_7_for    7    +    GTAAACA
mychr    13    21    mychr_13_21_for    8    +    GTAAACAA
mychr    25    33    mychr_25_33_rev    8    -    TTGTTTAC

 

Link to fastaRegexFinder.py here

ADD COMMENTlink modified 2.7 years ago • written 5.3 years ago by dariober10k

Your script was very useful. Thank you very much! But it takes a bit of time when dealing with larger genomes.

Edit:

Checking a 90bp sequence in both forward and reverse of 700 Mb genome, took 1 min 48 sec.

Only forward strand took 15 sec.

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Prakki Rama2.3k

Hi @dariober, is the fastaRegexFinder.py available to download for free? Thanks!

ADD REPLYlink written 2.7 years ago by varsha61980

Hi- Yes it's free and I have updated the download link in my answer.

ADD REPLYlink written 2.7 years ago by dariober10k
0
gravatar for swbarnes2
5.3 years ago by
swbarnes26.5k
United States
swbarnes26.5k wrote:

Another possibility, if you have the software lying around...short read aligners like bwa and Bowtie are excellent at quickly finding matches to a lot of short sequences in large genomes.  Though it may take a bit of work to get out the positions of every hit, if there are a lot of them.

ADD COMMENTlink written 5.3 years ago by swbarnes26.5k
0
gravatar for PoGibas
5.3 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

mbk0asis, as joe.cornish826 said, you have to use blast. The question you're asking is "if my presumably-novel-lncRNA-sequence exists in the database". I want to emphasize that you're looking for lncRNA, not just sequence.   

My suggestion would be:  use blast in NONCODE, that would be a good start. Have in mind that there are much more in the lncRNome than Gencode. See my previous answer about Retrieving The Sequences Of The Human Snorna, Lncrna, Etc....  

Also wanted to add that lncRNAs are not evolutionary conserved. Often it's hard to get signal using blast, so using grep would take you nowhere. 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by PoGibas4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2017 users visited in the last hour