Question: How to extract a sequence from a multi-fasta file based on its position?
0
gravatar for Dineshkumar K
8 months ago by
Kasaragod, Kerala, India
Dineshkumar K30 wrote:

Hi all, I have 6 sequences in a fasta file as shown below,

>seq1
ATGAT
>gen1
TAGTA
>org2
TATAA
>seq7
ATGCA

I would like to extract a sequence based on its position, for example, I need to extract 3rd position fasta sequence,

 >org2
TATAA

I can use samtools faidx command for fasta header/id based sequence extraction. But, here I need to extract based on its position. Please help me to do the same.

Thanks in advance.

ADD COMMENTlink modified 8 months ago by Joe16k • written 8 months ago by Dineshkumar K30
1

with sed and grep (assuming that fasta is linearized) with OP fasta:

$ sed -n '1~2p' test.fa | sed -n 3p | grep -A 1 -f - test.fa
>org2
TATAA
ADD REPLYlink written 8 months ago by cpad011212k
2
gravatar for SMK
8 months ago by
SMK1.9k
SMK1.9k wrote:

Hi Dineshkumar K,

Try:

samtools faidx test.fa "$(sed -n 3p <(cut -f1 test.fa.fai))"

Or both:

seqkit grep -p "$(sed -n 3p <(grep '^>' test.fa | sed 's/^>//g'))" test.fa
seqkit fx2tab test.fa | awk 'NR=="3"{print}' | seqkit tab2fx
ADD COMMENTlink modified 8 months ago • written 8 months ago by SMK1.9k

Thank you SMK, It works fine.

ADD REPLYlink written 8 months ago by Dineshkumar K30
2
gravatar for h.mon
8 months ago by
h.mon29k
Brazil
h.mon29k wrote:

Pure awk solution

cat miRNA_corn.fa \
  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \
              {printf("%s",$0);} \
         END {printf("\n");}' \
  | awk '{if ( NR==3 ) print $0;}' \
  | awk '{print $1"\n"$2}'

Steps consist of

  1. linearize fasta file, separating header and sequence by tabs

  2. print specific line number

  3. convert the sequences from tabular to fasta

ADD COMMENTlink written 8 months ago by h.mon29k
1
gravatar for ATpoint
8 months ago by
ATpoint29k
Germany
ATpoint29k wrote:

Using a mix of bash and samtools:

function ExtractByPosition {

  FASTA=$1
  POSITION=$2

  if [[ ! -e ${FASTA}.fai ]]; then samtools faidx $FASTA; fi

  grep '>' $FASTA \
    | head -n $POSITION \
    | tail -n 1 \
    | awk '{gsub(">","");print}' \
    | xargs samtools faidx $FASTA

}

ExtractByPosition your.fasta 3
ADD COMMENTlink modified 8 months ago • written 8 months ago by ATpoint29k
1
gravatar for michau
8 months ago by
michau30
Germany/Berlin/MPIMG
michau30 wrote:

python with Biopython

from Bio import AlignIO

N = 3           # position of seq you want to retreive
with open('yourALN.fa' 'r') as infile:
    aln = AlignIO.read(infile, 'fasta')
    print(aln[N-1])
ADD COMMENTlink modified 8 months ago • written 8 months ago by michau30
1

Just a comment to point out that you don't need to use open() on files in biopython, that's handled internally by the IO methods. OP also isn't using an aligned format (or at least hasn't said they are, and there example data doesn't suggest this), so SeqIO is probably the right choice.

So, this code would probably be better as:

from Bio import SeqIO

n = 3
for i, rec in enumerate(SeqIO.parse('test.fa', 'fasta')):
    if i == n:
        print(">{}\n{}".format(rec.description, rec.seq))
ADD REPLYlink modified 8 months ago • written 8 months ago by Joe16k
1
gravatar for Joe
8 months ago by
Joe16k
United Kingdom
Joe16k wrote:

Just for sadism completeness, here's a handy one-line version for extracting any position from a fasta, based on my comment.

python3 -c 'import sys;from Bio import SeqIO; [print(f">{rec.description}\n{rec.seq}") for i, rec in enumerate(SeqIO.parse(sys.argv[1],"fasta")) if i == int(sys.argv[2])-1 ];' test.fa 1

Where n is the sequence you want.

Note, use of fstrings and print-in-list-comprehension will require Python >= 3.6

ADD COMMENTlink modified 8 months ago • written 8 months ago by Joe16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 747 users visited in the last hour