Question: Extracting the coordinates for a specific base from multifasta
0
4 months ago by
Louis Kok10
Singapore
Louis Kok10 wrote:

Hi, I would like to extract coordinates for a specific base, let say all "A" base from multi-fasta. The way I I used to extract from a single sequence (without header) is as below:

``````grep -aob "A"  input.seq | sed 's/:/ /g' | awk '{print \$1+1}'
``````

Is there an easier way to deal with large and multi fasta? Thanks.

sequence fasta • 237 views
modified 4 months ago by jrj.healey12k • written 4 months ago by Louis Kok10
3
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

linearize, read each pair of name/sequence; print the name ; echo the sequence, convert each base to a new line with `grep -o .`, get the line number with `nl`, grep the lines ending with 'A'.

`````` awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),\$0);N++;next;} {printf("%s",\$0);} END {printf("\n");}' input.fasta  |\
while IFS=\$'\t' read A B; do echo \$A && echo \$B | grep -o . | nl | grep 'A\$' ;done

>seq1
5  A
12  A
14  A
18  A
23  A
27  A
30  A
32  A
44  A
>seq2
7  A
8  A
10  A
14  A
18  A
19  A
23  A
26  A
28  A
34  A
40  A
43  A
``````
3
4 months ago by
India

using seqkit:

``````\$ seqkit locate -ip "a" test.fa

seqID   patternName pattern strand  start   end matched
seq1    a   a   +   1   1   a
seq1    a   a   +   7   7   a
seq1    a   a   +   11  11  a
seq1    a   a   -   10  10  a
seq1    a   a   -   4   4   a
seq2    a   a   +   1   1   a
seq2    a   a   +   7   7   a
seq2    a   a   +   11  11  a
seq2    a   a   -   10  10  a
seq2    a   a   -   4   4   a

\$ cat test.fa
>seq1
agctggagctacc
>seq2
agctggagctacc
``````
1
4 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

A Python option:

``````# Define a test sequence
>>> string = 'ATAGCTACGTACGTACGTACGTACGTACGA'

# A function for finding all substrings
>>> find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# Gives the following indices for 'A' characters
>>> find_all(string, 'A')
[0, 2, 6, 10, 14, 18, 22, 26, 29]

# To apply it to all sequences in a multifasta:
>>> from Bio import SeqIO
>>> seqs = list(SeqIO.parse('seqs.fa','fasta'))
>>> [find_all(str(s.seq), 'A') for s in seqs]
[[80, 85, 92, 97, 98, 101, 103, 119, 121, 122, 125, 127, 129, 130, 148],    # First seq
[86, 98, 103, 104, 107, 125, 127, 128, 131, 133, 135, 136, 154],           # Second seq, etc.
[74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 121, 123, 124, 142],
[74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 120, 121, 123, 124, 142]]
``````

(Input data used)

``````>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC
``````