Question

How to extract a sequence from a multi-fasta file based on its position?

0

Entering edit mode

4.9 years ago

Kumar ▴ 120

Hi all,

I have 6 sequences in a fasta file as shown below,

>seq1
ATGAT
>gen1
TAGTA
>org2
TATAA
>seq7
ATGCA

I would like to extract a sequence based on its position, for example, I need to extract 3rd position fasta sequence,

 >org2
TATAA

I can use samtools faidx command for fasta header/id based sequence extraction. But, here I need to extract based on its position. Please help me to do the same.

Thanks in advance

fasta perl python shell-script • 3.0k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 4.9 years ago by Kumar ▴ 120

1

Entering edit mode

with sed and grep (assuming that fasta is linearized) with OP fasta:

$ sed -n '1~2p' test.fa | sed -n 3p | grep -A 1 -f - test.fa
>org2
TATAA

ADD REPLY • link 4.9 years ago by cpad0112 21k

score 2 · Answer 1 · 2019-06-02

2

Entering edit mode

4.9 years ago

AK ★ 2.2k

Hi Dineshkumar K,

Try:

samtools faidx test.fa "$(sed -n 3p <(cut -f1 test.fa.fai))"

Or both:

seqkit grep -p "$(sed -n 3p <(grep '^>' test.fa | sed 's/^>//g'))" test.fa
seqkit fx2tab test.fa | awk 'NR=="3"{print}' | seqkit tab2fx

ADD COMMENT • link 4.9 years ago by AK ★ 2.2k

0

Entering edit mode

Thank you SMK, It works fine.

ADD REPLY • link 4.9 years ago by Kumar ▴ 120

score 2 · Answer 2 · 2019-06-03

Pure awk solution

cat miRNA_corn.fa \
  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \
              {printf("%s",$0);} \
         END {printf("\n");}' \
  | awk '{if ( NR==3 ) print $0;}' \
  | awk '{print $1"\n"$2}'

Steps consist of

linearize fasta file, separating header and sequence by tabs
print specific line number
convert the sequences from tabular to fasta

score 1 · Answer 3 · 2019-06-02

Using a mix of bash and samtools:

function ExtractByPosition {

  FASTA=$1
  POSITION=$2

  if [[ ! -e ${FASTA}.fai ]]; then samtools faidx $FASTA; fi

  grep '>' $FASTA \
    | head -n $POSITION \
    | tail -n 1 \
    | awk '{gsub(">","");print}' \
    | xargs samtools faidx $FASTA

}

ExtractByPosition your.fasta 3

score 1 · Answer 4 · 2019-06-04

1

Entering edit mode

4.9 years ago

michau ▴ 60

python with Biopython

from Bio import AlignIO

N = 3           # position of seq you want to retreive
with open('yourALN.fa' 'r') as infile:
    aln = AlignIO.read(infile, 'fasta')
    print(aln[N-1])

ADD COMMENT • link 4.9 years ago by michau ▴ 60

1

Entering edit mode

Just a comment to point out that you don't need to use open() on files in biopython, that's handled internally by the IO methods. OP also isn't using an aligned format (or at least hasn't said they are, and there example data doesn't suggest this), so SeqIO is probably the right choice.

So, this code would probably be better as:

from Bio import SeqIO

n = 3
for i, rec in enumerate(SeqIO.parse('test.fa', 'fasta')):
    if i == n:
        print(">{}\n{}".format(rec.description, rec.seq))

ADD REPLY • link 4.9 years ago by Joe 21k

score 1 · Answer 5 · 2019-06-07

Just for ~~sadism~~ completeness, here's a handy one-line version for extracting any position from a fasta, based on my comment.

python3 -c 'import sys;from Bio import SeqIO; [print(f">{rec.description}\n{rec.seq}") for i, rec in enumerate(SeqIO.parse(sys.argv[1],"fasta")) if i == int(sys.argv[2])-1 ];' test.fa 1

Where n is the sequence you want.

Note, use of fstrings and print-in-list-comprehension will require Python >= 3.6