Question: Pattren counter in fasta file
0
gravatar for adityagt15
3.3 years ago by
adityagt150
adityagt150 wrote:

I am trying to get the count of the matching pattren in the fasta file. i am having the fasta files which contains 57k sequences. i want to pull out the count of my matching pattern sequence. And required the starting position of the pattren Input file:

chr1 ATTAGCAGATGTGACGTCGATGTCAGATTG

chr2 TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT

chr3 TCTTTCAGATGCCTCTGCAGATTC

Searching pattern "CAGAT"

Output Required:

chr Count P1 P2

chr1 -- 2 -- 6 -- 25

chr2 -- 1 -- 8

chr3 -- 2 -- 6 -- 19

Thanks in advance

ADD COMMENTlink modified 3.3 years ago by cpad011214k • written 3.3 years ago by adityagt150
3
gravatar for shenwei356
3.3 years ago by
shenwei3565.3k
China
shenwei3565.3k wrote:

You can also try SeqKit.

$ seqkit locate -p CAGAT seqs.fa | column -t
seqID  patternName  pattern  strand  start  end  matched
chr1   CAGAT        CAGAT    +       6      10   CAGAT
chr1   CAGAT        CAGAT    +       24     28   CAGAT
chr2   CAGAT        CAGAT    +       8      12   CAGAT
chr3   CAGAT        CAGAT    +       6      10   CAGAT
chr3   CAGAT        CAGAT    +       18     22   CAGAT

$ seqkit locate -p CAGAT seqs.fa --bed
chr1    5       10      CAGAT   0       +
chr1    23      28      CAGAT   0       +
chr2    7       12      CAGAT   0       +
chr3    5       10      CAGAT   0       +
chr3    17      22      CAGAT   0       +

$ seqkit locate -p CAGAT seqs.fa --gtf
chr1    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr1    SeqKit  location        24      28      0       +       .       gene_id "CAGAT"; 
chr2    SeqKit  location        8       12      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        6       10      0       +       .       gene_id "CAGAT"; 
chr3    SeqKit  location        18      22      0       +       .       gene_id "CAGAT";

Count of every sequences with help of csvtk:

$ seqkit locate -p CAGAT seqs.fa | csvtk freq -t -f seqID -k
seqID   frequency
chr1    2
chr2    1
chr3    2

# multiple search patterns
$ seqkit locate -p CAGAT -p TTC seqs.fa | csvtk freq -t -f seqID,patternName -k | column -t
seqID  patternName  frequency
chr1   CAGAT        2
chr2   CAGAT        1
chr2   TTC          2
chr3   CAGAT        2
chr3   TTC          2
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by shenwei3565.3k
1
gravatar for dariober
3.3 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

If interested, I wrote a program for this kind of searches, it's here fastaRegexFinder

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by dariober11k

Thanks for the response, The link is not opening sir...

ADD REPLYlink written 3.3 years ago by adityagt150

Ooops, sorry, try now

ADD REPLYlink written 3.3 years ago by dariober11k

is this the way i have to give input

python code.py -f singleline.fasta -m 'CTGCAG'

ADD REPLYlink written 3.3 years ago by adityagt150

It should be

python /path/to/fastaRegexFinder.py -f seq.fasta -r 'CAGAT'

It should be fairly well documented how to use it.

ADD REPLYlink written 3.3 years ago by dariober11k

Thanks for your response , its working..

ADD REPLYlink written 3.3 years ago by adityagt150
1
gravatar for cpad0112
3.3 years ago by
cpad011214k
India
cpad011214k wrote:

for R coders:

my fasta file:

>chr1
ATTAGCAGATGTGACGTCGATGTCAGATTG
>chr2
TGAGCTGCAGATCGTAGATGATTCTGCAGGAACCT
>chr3
TCTTTCAGATGCCTCTGCAGATTC

R code

library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=as.data.frame(vmatchPattern(query,data))

result

> result
  group group_name start end width
1     1       <NA>     6  10     5
2     1       <NA>    24  28     5
3     2       <NA>     8  12     5
4     3       <NA>     6  10     5
5     3       <NA>    18  22     5
>
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by cpad011214k

How do we figure out what patterns are represented in that table? You are using CAGAT as the original question but it is not obvious in the results table in present form.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax89k
1
library(Biostrings)
data=readDNAStringSet("~/Desktop/test.fa", format = "fasta")
query=DNAString("CAGAT")
result=cbind(pattern=as.character(query),as.data.frame(vmatchPattern(query,data)))

> result
  pattern group group_name start end width
1   CAGAT     1       <NA>     6  10     5
2   CAGAT     1       <NA>    24  28     5
3   CAGAT     2       <NA>     8  12     5
4   CAGAT     3       <NA>     6  10     5
5   CAGAT     3       <NA>    18  22     5

*Still doesn't work multiple patterns.

ADD REPLYlink written 3.3 years ago by cpad011214k

Would be better to update the code in the original answer by editing that since that is what people will try to use. This looks better. Can you fix the group names so the actual sequences the pattern is coming from are identified?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1598 users visited in the last hour