Question

Pattren counter in fasta file

0

Entering edit mode

8.1 years ago

adityagt15 • 0

I am trying to get the count of the matching pattren in the fasta file. i am having the fasta files which contains 57k sequences. i want to pull out the count of my matching pattern sequence. And required the starting position of the pattren Input file:

chr1 ATTAGCAGATGTGACGTCGATGTCAGATTG

chr2 TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT

chr3 TCTTTCAGATGCCTCTGCAGATTC

Searching pattern "CAGAT"

Output Required:

chr Count P1 P2

chr1 -- 2 -- 6 -- 25

chr2 -- 1 -- 8

chr3 -- 2 -- 6 -- 19

Thanks in advance

counter perl awk Regular Expression fasta • 2.3k views

ADD COMMENT • link updated 8.1 years ago by cpad0112 21k • written 8.1 years ago by adityagt15 • 0

score 3 · Answer 1 · 2017-05-30

You can also try SeqKit.

$ seqkit locate -p CAGAT seqs.fa | column -t
seqID  patternName  pattern  strand  start  end  matched
chr1   CAGAT        CAGAT    +       6      10   CAGAT
chr1   CAGAT        CAGAT    +       24     28   CAGAT
chr2   CAGAT        CAGAT    +       8      12   CAGAT
chr3   CAGAT        CAGAT    +       6      10   CAGAT
chr3   CAGAT        CAGAT    +       18     22   CAGAT

$ seqkit locate -p CAGAT seqs.fa --bed
chr1    5       10      CAGAT   0       +
chr1    23      28      CAGAT   0       +
chr2    7       12      CAGAT   0       +
chr3    5       10      CAGAT   0       +
chr3    17      22      CAGAT   0       +

$ seqkit locate -p CAGAT seqs.fa --gtf
chr1    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr1    SeqKit  location        24      28      0       +       .       gene_id "CAGAT"; 
chr2    SeqKit  location        8       12      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        18      22      0       +       .       gene_id "CAGAT";

Count of every sequences with help of csvtk:

$ seqkit locate -p CAGAT seqs.fa | csvtk freq -t -f seqID -k
seqID   frequency
chr1    2
chr2    1
chr3    2

# multiple search patterns
$ seqkit locate -p CAGAT -p TTC seqs.fa | csvtk freq -t -f seqID,patternName -k | column -t
seqID  patternName  frequency
chr1   CAGAT        2
chr2   CAGAT        1
chr2   TTC          2
chr3   CAGAT        2
chr3   TTC          2

score 1 · Answer 2 · 2017-05-30

1

Entering edit mode

8.1 years ago

dariober 15k

If interested, I wrote a program for this kind of searches, it's here fastaRegexFinder

ADD COMMENT • link 8.1 years ago by dariober 15k

0

Entering edit mode

Thanks for the response, The link is not opening sir...

ADD REPLY • link 8.1 years ago by adityagt15 • 0

0

Entering edit mode

Ooops, sorry, try now

ADD REPLY • link 8.1 years ago by dariober 15k

0

Entering edit mode

is this the way i have to give input

python code.py -f singleline.fasta -m 'CTGCAG'

ADD REPLY • link 8.1 years ago by adityagt15 • 0

0

Entering edit mode

It should be

python /path/to/fastaRegexFinder.py -f seq.fasta -r 'CAGAT'

It should be fairly well documented how to use it.

ADD REPLY • link 8.1 years ago by dariober 15k

0

Entering edit mode

Thanks for your response , its working..

ADD REPLY • link 8.1 years ago by adityagt15 • 0

score 1 · Answer 3 · 2017-06-02

1

Entering edit mode

8.1 years ago

cpad0112 21k

for R coders:

my fasta file:

>chr1
ATTAGCAGATGTGACGTCGATGTCAGATTG
>chr2
TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT
>chr3
TCTTTCAGATGCCTCTGCAGATTC

R code

library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=as.data.frame(vmatchPattern(query,data))

result

> result
  group group_name start end width
1     1       <NA>     6  10     5
2     1       <NA>    24  28     5
3     2       <NA>     8  12     5
4     3       <NA>     6  10     5
5     3       <NA>    18  22     5
>

ADD COMMENT • link 8.1 years ago by cpad0112 21k

0

Entering edit mode

How do we figure out what patterns are represented in that table? You are using CAGAT as the original question but it is not obvious in the results table in present form.

ADD REPLY • link 8.1 years ago by GenoMax 152k

1

Entering edit mode

library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=cbind(pattern=as.character(query),as.data.frame(vmatchPattern(query,data)))

> result
  pattern group group_name start end width
1   CAGAT     1       <NA>     6  10     5
2   CAGAT     1       <NA>    24  28     5
3   CAGAT     2       <NA>     8  12     5
4   CAGAT     3       <NA>     6  10     5
5   CAGAT     3       <NA>    18  22     5

*Still doesn't work multiple patterns.

ADD REPLY • link 8.1 years ago by cpad0112 21k

0

Entering edit mode

Would be better to update the code in the original answer by editing that since that is what people will try to use. This looks better. Can you fix the group names so the actual sequences the pattern is coming from are identified?

ADD REPLY • link 8.1 years ago by GenoMax 152k