Question: Extracting the coordinates for a specific base from multifasta
0
gravatar for Louis Kok
4 months ago by
Louis Kok10
Singapore
Louis Kok10 wrote:

Hi, I would like to extract coordinates for a specific base, let say all "A" base from multi-fasta. The way I I used to extract from a single sequence (without header) is as below:

grep -aob "A"  input.seq | sed 's/:/ /g' | awk '{print $1+1}'

Is there an easier way to deal with large and multi fasta? Thanks.

sequence fasta • 237 views
ADD COMMENTlink modified 4 months ago by jrj.healey12k • written 4 months ago by Louis Kok10
3
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

linearize, read each pair of name/sequence; print the name ; echo the sequence, convert each base to a new line with grep -o ., get the line number with nl, grep the lines ending with 'A'.

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
while IFS=$'\t' read A B; do echo $A && echo $B | grep -o . | nl | grep 'A$' ;done

>seq1
     5  A
    12  A
    14  A
    18  A
    23  A
    27  A
    30  A
    32  A
    44  A
>seq2
     7  A
     8  A
    10  A
    14  A
    18  A
    19  A
    23  A
    26  A
    28  A
    34  A
    40  A
    43  A
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum119k
3
gravatar for cpad0112
4 months ago by
cpad011211k
India
cpad011211k wrote:

using seqkit:

$ seqkit locate -ip "a" test.fa 

seqID   patternName pattern strand  start   end matched
seq1    a   a   +   1   1   a
seq1    a   a   +   7   7   a
seq1    a   a   +   11  11  a
seq1    a   a   -   10  10  a
seq1    a   a   -   4   4   a
seq2    a   a   +   1   1   a
seq2    a   a   +   7   7   a
seq2    a   a   +   11  11  a
seq2    a   a   -   10  10  a
seq2    a   a   -   4   4   a

$ cat test.fa 
>seq1
agctggagctacc
>seq2
agctggagctacc
ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad011211k
1
gravatar for jrj.healey
4 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

A Python option:

# Define a test sequence
>>> string = 'ATAGCTACGTACGTACGTACGTACGTACGA'

# A function for finding all substrings
>>> find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# Gives the following indices for 'A' characters
>>> find_all(string, 'A')
[0, 2, 6, 10, 14, 18, 22, 26, 29]

# To apply it to all sequences in a multifasta:
>>> from Bio import SeqIO
>>> seqs = list(SeqIO.parse('seqs.fa','fasta'))
>>> [find_all(str(s.seq), 'A') for s in seqs]
[[80, 85, 92, 97, 98, 101, 103, 119, 121, 122, 125, 127, 129, 130, 148],    # First seq
 [86, 98, 103, 104, 107, 125, 127, 128, 131, 133, 135, 136, 154],           # Second seq, etc.
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 121, 123, 124, 142], 
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 120, 121, 123, 124, 142]]

(Input data used)

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC
ADD COMMENTlink modified 4 months ago • written 4 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1050 users visited in the last hour