Question: How to know that required sequence has been extracted
0
gravatar for rupashree.choudhury
15 months ago by
rupashree.choudhury10 wrote:

Hi

I am new to python. I have a multi-fasta file containing , 21 merged fasta file of Mycosphaerella graminicola chr1-21 ,whole genome shotgun sequence.I have a python code that takes Id as input(>ENA|CM001216|CM001216.1) and extract respective sequence.

My problem is, I am not sure how to make sure that right sequences is being extracted. On linux , console, the ouput prints something like below, the huge chunk of sequence without any track of starting with Ids

ACATATTGCAGAGCATGGAGTTCTCCCGGTCTACAGTTAAGACCAGATCCGCTACATCGC
ATAGGAAAGACTCCTCGCTAGCAGCACTAGTGCACCGGACGATGACCGACCGGTCCTATC
GCTCGAGCGGTCCTATCTAACACACCAGCAGCGGCCCTTTACAATAACGCGCCGGTAGGA
CATTGATGCCCTATTCCTACCACTAACCTCACTTGCAGCCTATCGGCATAGCATCTAGTT
CTGTTACCAGCTACGGTTTACAAAGACTCTTTCTAGCAACTGGCACACACCTATCGGCCG
GTTAGGCATCCGCCCAACGATATGCAAGCACTTCCGGCTCGCCCGTAGCTTTAGTGCTAC
TAGCTTCGATACCTACATCTTCTTCCCAAACCTACCTTTGCTATCTACAAAGCGGTAACG
AGCGGTCGACAAACAGCGCTACTGCTACCTTAACCGGATCTAGCAAGCGCATTTTATTAA
CGAAATCTTTATACCTACCCTCGAAGACACCTACTAAGACGACATTCTCCAGCACCATCC
TCGCACTTTCGAACAGGCGTATAACCGGTCCTACGCTCGCTAGCGGCAGTCGATTCGGTC
GACCAACCGGTCGTTTAGCGAGCTATTCTACGACCTTCCAAATACATGGCCGAGCGATCA
GGCCCGGCGTCCTAGGGTACCACTGCTCGCGCGTCTCTAGACCCGTATTTGTGAACGCCT
AGCAGACGACGAAGAGGACTAGCAGGAAGCGGAACTGATTACAGTACGCTACGGCTACAA
GCTCGAGCTCCGTAAGACGTCTGTCGCCAGCCTTCGGCAAGTGGTCGACCGGTCCCTTAA
ACGACAATTCGACAAGGAGTACATCGACTTCGACACTGCCTTCGTAGATCTTGCGTTTAA
GGACCTAGCCCACCAACCTACAAACCGGTCGAGGGTTTATCGAGAACCCCTCGTTCTACT
CCGCTGTACGAACTGCAACGAATACCAAGCCAGCTAGCTTAATATAAAGCCAGAGCTCTA
CTAGTGGAGTAGG
sehlly@ShellysPC:~/Documents/Mycosphaerella_sequence1$

My code is:

#!/usr/bin/python
import re
import sys

Found = False
InputID = raw_input("Input Id :")

with open ('Mgraminicola.fasta','r') as f:
              seqs = f.readlines()

for seq in seqs:
          if re.search (InputID,seq):
                  Found = True
                  sys.stdout.write(seq)
                    continue
if Found == True:
                    if re.search ('>ENA|CM00',seq):      
                              Found = False
                    if Found:
                               sys.stdout.write(seq)

Any help would be highly appreciated.

Thanks

sequence • 391 views
ADD COMMENTlink modified 15 months ago by Macspider2.8k • written 15 months ago by rupashree.choudhury10
1

Have a look at: http://biopython.org/wiki/SeqIO

Also, you don't need to separate each code line with an empty line!

ADD REPLYlink written 15 months ago by Macspider2.8k
1
gravatar for bioExplorer
15 months ago by
bioExplorer3.7k
bioExplorer3.7k wrote:

Bingo, I have my own code in Python


Usage

fetch_sequence_by_id.py -i your_fasta_file.fa -f file_containing_sequence_ids.txt -o output_fasta_file.fa

Program

from Bio import SeqIO
import getopt, sys, re
id_list=[]
sequence_list={}

def usage():
    print "Usage: fetch_sequence_by_id.py -i <input_fasta> -f <file_containing_sequence_ids> -o <output_fasta_file>"


try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:f:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-f'):
        id_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out_fasta=open(output_file, 'w')
out_id_file=open(id_file)

for ids in out_id_file:
    ids=ids.rstrip()
    id_list.append(ids)

for record in SeqIO.parse(input_file, "fasta"):
    sequence_list[record.id]=record.seq


for ids in id_list:
    for record in SeqIO.parse(input_file, "fasta"):
        if ids in strrecord.id):
            out_fasta.write(">" + ids +  "\n")
            out_fasta.write(str(record.seq) + "\n")

out_fasta.close()
ADD COMMENTlink modified 15 months ago • written 15 months ago by bioExplorer3.7k
1

Ultra fast solution using seqkit

cat Your_fasta.fa | seqkit grep -f your_ids.txt > required_fasta.fa
ADD REPLYlink written 15 months ago by bioExplorer3.7k
2

Ultra-fast solution has to be faSomeRecords from Jim Kent.

ADD REPLYlink written 15 months ago by genomax64k
1

Thanks for updating me. I dint know that

ADD REPLYlink written 15 months ago by bioExplorer3.7k
1
gravatar for Macspider
15 months ago by
Macspider2.8k
Vienna - BOKU
Macspider2.8k wrote:

Same here, clone if you want:

https://github.com/MatteoSchiavinato/Utilities/blob/master/select-sequences-from-filename.py

ADD COMMENTlink written 15 months ago by Macspider2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1983 users visited in the last hour