Question: Extracting the coordinates for a specific base from multifasta
0
gravatar for Louis Kok
11 months ago by
Louis Kok10
Singapore
Louis Kok10 wrote:

Hi, I would like to extract coordinates for a specific base, let say all "A" base from multi-fasta. The way I I used to extract from a single sequence (without header) is as below:

grep -aob "A"  input.seq | sed 's/:/ /g' | awk '{print $1+1}'

Is there an easier way to deal with large and multi fasta? Thanks.

sequence fasta • 364 views
ADD COMMENTlink modified 11 months ago by Joe14k • written 11 months ago by Louis Kok10
3
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

linearize, read each pair of name/sequence; print the name ; echo the sequence, convert each base to a new line with grep -o ., get the line number with nl, grep the lines ending with 'A'.

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
while IFS=$'\t' read A B; do echo $A && echo $B | grep -o . | nl | grep 'A$' ;done

>seq1
     5  A
    12  A
    14  A
    18  A
    23  A
    27  A
    30  A
    32  A
    44  A
>seq2
     7  A
     8  A
    10  A
    14  A
    18  A
    19  A
    23  A
    26  A
    28  A
    34  A
    40  A
    43  A
ADD COMMENTlink written 11 months ago by Pierre Lindenbaum124k
3
gravatar for cpad0112
11 months ago by
cpad011212k
India
cpad011212k wrote:

using seqkit:

$ seqkit locate -ip "a" test.fa 

seqID   patternName pattern strand  start   end matched
seq1    a   a   +   1   1   a
seq1    a   a   +   7   7   a
seq1    a   a   +   11  11  a
seq1    a   a   -   10  10  a
seq1    a   a   -   4   4   a
seq2    a   a   +   1   1   a
seq2    a   a   +   7   7   a
seq2    a   a   +   11  11  a
seq2    a   a   -   10  10  a
seq2    a   a   -   4   4   a

$ cat test.fa 
>seq1
agctggagctacc
>seq2
agctggagctacc
ADD COMMENTlink modified 11 months ago • written 11 months ago by cpad011212k
1
gravatar for Joe
11 months ago by
Joe14k
United Kingdom
Joe14k wrote:

A Python option:

# Define a test sequence
>>> string = 'ATAGCTACGTACGTACGTACGTACGTACGA'

# A function for finding all substrings
>>> find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# Gives the following indices for 'A' characters
>>> find_all(string, 'A')
[0, 2, 6, 10, 14, 18, 22, 26, 29]

# To apply it to all sequences in a multifasta:
>>> from Bio import SeqIO
>>> seqs = list(SeqIO.parse('seqs.fa','fasta'))
>>> [find_all(str(s.seq), 'A') for s in seqs]
[[80, 85, 92, 97, 98, 101, 103, 119, 121, 122, 125, 127, 129, 130, 148],    # First seq
 [86, 98, 103, 104, 107, 125, 127, 128, 131, 133, 135, 136, 154],           # Second seq, etc.
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 121, 123, 124, 142], 
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 120, 121, 123, 124, 142]]

(Input data used)

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC
ADD COMMENTlink modified 11 months ago • written 11 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1190 users visited in the last hour