Question: extracting sequences from a fasta file
0
gravatar for ashish
14 months ago by
ashish200
ashish200 wrote:

I am using this biopython script from this post, first answer written by Eric. The post was very old so I am adding a new post for it. the script take ids from a .txt file and extracts their corresponding sequences from another fasta file. But I've a problem here, the ids i am extracting lies in the description whereas the script searches just the first word after > sign. how do I change it so that it can look for the ids i am providing in the header description. I tried changing it myself after reading the comments but I think I am doing it wrong. my .txt id file look like this:

TRIAE_CS42_1AL_TGACv1_000062_AA0001

TRIAE_CS42_1AL_TGACv1_000089_AA0002

TRIAE_CS42_1AL_TGACv1_000099_AA0003

TRIAE_CS42_1AL_TGACv1_000110_AA0004

TRIAE_CS42_1AL_TGACv1_000140_AA0005

The header in the fasta file looks like this:

>TRIAE_CS42_U_TGACv1_641895_AA2106830.1 pep scaffold:TGACv1:TGACv1_scaffold_641895_U:99996:109837:1 gene:TRIAE_CS42_U_TGACv1_000110_AA0004 transcript:TRIAE_CS42_U_TGACv1_641895_AA2106830.1 gene_biotype:protein_coding transcript_biotype:protein_coding
biopython python • 782 views
ADD COMMENTlink modified 14 months ago by Buffo1.1k • written 14 months ago by ashish200

so, are the list gene names?

ADD REPLYlink written 14 months ago by shenwei3564.0k

The list are gene ids And the fasta file have protein sequences which have the gene id written in the header description

ADD REPLYlink written 14 months ago by ashish200

above solution should work.

ADD REPLYlink written 14 months ago by shenwei3564.0k

This worked very well. It was so easy. Can you explain what does this "gene:([^ ]+)" mean. In the tool help I found this line:
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?") what does the symbols mean?

ADD REPLYlink modified 14 months ago • written 14 months ago by ashish200
1

Test using regular expression tester page.

ADD REPLYlink written 14 months ago by genomax51k

Thanks, this was really important for me to see.

ADD REPLYlink written 14 months ago by ashish200
1

it's a regular expression for matching "gene:xxxxx", 、[^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxx as FASTA ID.

ADD REPLYlink written 14 months ago by shenwei3564.0k

ashish : I moved @shenwei356's comment to an answer. Since it worked for you, please accept the answer (use green check mark) to provide closure for this question.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax51k
2
gravatar for shenwei356
14 months ago by
shenwei3564.0k
China
shenwei3564.0k wrote:

An easy way provided by seqkit:

seqkit grep -f ids.txt --id-regexp  "gene:([^ ]+)"   seqs.fa

Cause the ID you want to search is in the FASTA description not the regular FASTA ID. SeqKit provides way to specify where the ID is by regular expression.

"gene:([^ ]+)" is a regular expression for matching "gene:xxxxx" which contains the gene-id for searching. [^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxxxx as FASTA ID.

ADD COMMENTlink modified 14 months ago • written 14 months ago by shenwei3564.0k

Hello, I have installed Seqkit.I am trying to list out some sequences based on the Seq ids.provided in a text doc.I am using this code. my ids.txt file looks like this: bin1 wbah10_accessory_1487_length_224941 bin2 wbah10_accessory_1485_length_153623 bin4 wbah10_accessory_1593_length_85091 bin5 wbah10_accessory_0973_length_66623 bin6 wbah10_accessory_0972_length_51198 bin7 wbah10_accessory_1486_length_50757 bin8 wbah10_accessory_0969_length_49768

and the header in the query file looks like this:

bin1 wbah10_accessory_1487_length_224941 tccgccttcgctaaagcttccgccttcgccaaggcttcggcgcgacaagtccgcttcggcccgatttctcaccagaatttgcgattttttacggcgccggactcgcggagggtccccctcacccggaatccgcgcgtcgcgcggattccggcctctccccggcggggagaggcgaagggaagcggcccttatttcggcaggaattcctgcgcaacgcccataccga

seqkit grep -f ids.txt --id-regexp "gene:([^ ]+)" seqs.fa

but I am getting an error mentioned below:

[ERRO] fastx: stdin not detected

I am new to the command line approach.Any help would be appreciated.

Thanks in advance

Sohini

ADD REPLYlink modified 18 days ago • written 18 days ago by sohiniguha19850

For you

seqkit grep -n -f ids.txt seqs.fa -o result.fa

[ERRO] fastx: stdin not detected means no input file provided and stdin not detected.

ADD REPLYlink modified 16 days ago • written 16 days ago by shenwei3564.0k
1
gravatar for Buffo
14 months ago by
Buffo1.1k
Buffo1.1k wrote:

This is my python script to do that, save it as sequence_extractor.py and run it :)

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python sequence_extractor.py *list.txt fasta_file.fasta
">" has to be included in the names 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 3:
    print syntax
    sys.exit()

##########################################################################################

nombres = []
seq = ""


lista_seq = open(sys.argv[1], 'r')

for line in lista_seq:
    line = line.rstrip('\n') 
    nombres.append(line)
lista_seq.close()

fasta_seqs = open(sys.argv[2], 'r')

for line in fasta_seqs:
    line = line.rstrip('\n')
    if line.startswith('>'):
        if seq:
            if name in nombres:           
                print name + '\n' + seq
                seq = ""
        name = line                         #to exclude '>'; line.split()[0]

    else:
        seq = seq + line 


if name in nombres:
    print name + '\n' + seq
ADD COMMENTlink modified 14 months ago • written 14 months ago by Buffo1.1k

it gives an error SyntaxError: Missing parentheses in call to 'print' at line 15

ADD REPLYlink written 14 months ago by ashish200
1

this script needs python2, you ran using py3.

ADD REPLYlink written 14 months ago by shenwei3564.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour