Question: How to extract a sequence from a multi-fasta file based on its position?
0
gravatar for Dineshkumar K
5 months ago by
Kasaragod, Kerala, India
Dineshkumar K30 wrote:

Hi all, I have 6 sequences in a fasta file as shown below,

>seq1
ATGAT
>gen1
TAGTA
>org2
TATAA
>seq7
ATGCA

I would like to extract a sequence based on its position, for example, I need to extract 3rd position fasta sequence,

 >org2
TATAA

I can use samtools faidx command for fasta header/id based sequence extraction. But, here I need to extract based on its position. Please help me to do the same.

Thanks in advance.

ADD COMMENTlink modified 5 months ago by Joe15k • written 5 months ago by Dineshkumar K30
1

with sed and grep (assuming that fasta is linearized) with OP fasta:

$ sed -n '1~2p' test.fa | sed -n 3p | grep -A 1 -f - test.fa
>org2
TATAA
ADD REPLYlink written 5 months ago by cpad011212k
2
gravatar for SMK
5 months ago by
SMK1.9k
SMK1.9k wrote:

Hi Dineshkumar K,

Try:

samtools faidx test.fa "$(sed -n 3p <(cut -f1 test.fa.fai))"

Or both:

seqkit grep -p "$(sed -n 3p <(grep '^>' test.fa | sed 's/^>//g'))" test.fa
seqkit fx2tab test.fa | awk 'NR=="3"{print}' | seqkit tab2fx
ADD COMMENTlink modified 5 months ago • written 5 months ago by SMK1.9k

Thank you SMK, It works fine.

ADD REPLYlink written 5 months ago by Dineshkumar K30
2
gravatar for h.mon
5 months ago by
h.mon28k
Brazil
h.mon28k wrote:

Pure awk solution

cat miRNA_corn.fa \
  | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} \
              {printf("%s",$0);} \
         END {printf("\n");}' \
  | awk '{if ( NR==3 ) print $0;}' \
  | awk '{print $1"\n"$2}'

Steps consist of

  1. linearize fasta file, separating header and sequence by tabs

  2. print specific line number

  3. convert the sequences from tabular to fasta

ADD COMMENTlink written 5 months ago by h.mon28k
1
gravatar for ATpoint
5 months ago by
ATpoint26k
Germany
ATpoint26k wrote:

Using a mix of bash and samtools:

function ExtractByPosition {

  FASTA=$1
  POSITION=$2

  if [[ ! -e ${FASTA}.fai ]]; then samtools faidx $FASTA; fi

  grep '>' $FASTA \
    | head -n $POSITION \
    | tail -n 1 \
    | awk '{gsub(">","");print}' \
    | xargs samtools faidx $FASTA

}

ExtractByPosition your.fasta 3
ADD COMMENTlink modified 5 months ago • written 5 months ago by ATpoint26k
1
gravatar for michau
5 months ago by
michau20
Poland/Gdansk
michau20 wrote:

python with Biopython

from Bio import AlignIO

N = 3           # position of seq you want to retreive
with open('yourALN.fa' 'r') as infile:
    aln = AlignIO.read(infile, 'fasta')
    print(aln[N-1])
ADD COMMENTlink modified 5 months ago • written 5 months ago by michau20
1

Just a comment to point out that you don't need to use open() on files in biopython, that's handled internally by the IO methods. OP also isn't using an aligned format (or at least hasn't said they are, and there example data doesn't suggest this), so SeqIO is probably the right choice.

So, this code would probably be better as:

from Bio import SeqIO

n = 3
for i, rec in enumerate(SeqIO.parse('test.fa', 'fasta')):
    if i == n:
        print(">{}\n{}".format(rec.description, rec.seq))
ADD REPLYlink modified 5 months ago • written 5 months ago by Joe15k
1
gravatar for Joe
5 months ago by
Joe15k
United Kingdom
Joe15k wrote:

Just for sadism completeness, here's a handy one-line version for extracting any position from a fasta, based on my comment.

python3 -c 'import sys;from Bio import SeqIO; [print(f">{rec.description}\n{rec.seq}") for i, rec in enumerate(SeqIO.parse(sys.argv[1],"fasta")) if i == int(sys.argv[2])-1 ];' test.fa 1

Where n is the sequence you want.

Note, use of fstrings and print-in-list-comprehension will require Python >= 3.6

ADD COMMENTlink modified 5 months ago • written 5 months ago by Joe15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1265 users visited in the last hour