Question: How to know that required sequence has been extracted
0
gravatar for rupashree.choudhury
23 months ago by
rupashree.choudhury10 wrote:

Hi

I am new to python. I have a multi-fasta file containing , 21 merged fasta file of Mycosphaerella graminicola chr1-21 ,whole genome shotgun sequence.I have a python code that takes Id as input(>ENA|CM001216|CM001216.1) and extract respective sequence.

My problem is, I am not sure how to make sure that right sequences is being extracted. On linux , console, the ouput prints something like below, the huge chunk of sequence without any track of starting with Ids

ACATATTGCAGAGCATGGAGTTCTCCCGGTCTACAGTTAAGACCAGATCCGCTACATCGC
ATAGGAAAGACTCCTCGCTAGCAGCACTAGTGCACCGGACGATGACCGACCGGTCCTATC
GCTCGAGCGGTCCTATCTAACACACCAGCAGCGGCCCTTTACAATAACGCGCCGGTAGGA
CATTGATGCCCTATTCCTACCACTAACCTCACTTGCAGCCTATCGGCATAGCATCTAGTT
CTGTTACCAGCTACGGTTTACAAAGACTCTTTCTAGCAACTGGCACACACCTATCGGCCG
GTTAGGCATCCGCCCAACGATATGCAAGCACTTCCGGCTCGCCCGTAGCTTTAGTGCTAC
TAGCTTCGATACCTACATCTTCTTCCCAAACCTACCTTTGCTATCTACAAAGCGGTAACG
AGCGGTCGACAAACAGCGCTACTGCTACCTTAACCGGATCTAGCAAGCGCATTTTATTAA
CGAAATCTTTATACCTACCCTCGAAGACACCTACTAAGACGACATTCTCCAGCACCATCC
TCGCACTTTCGAACAGGCGTATAACCGGTCCTACGCTCGCTAGCGGCAGTCGATTCGGTC
GACCAACCGGTCGTTTAGCGAGCTATTCTACGACCTTCCAAATACATGGCCGAGCGATCA
GGCCCGGCGTCCTAGGGTACCACTGCTCGCGCGTCTCTAGACCCGTATTTGTGAACGCCT
AGCAGACGACGAAGAGGACTAGCAGGAAGCGGAACTGATTACAGTACGCTACGGCTACAA
GCTCGAGCTCCGTAAGACGTCTGTCGCCAGCCTTCGGCAAGTGGTCGACCGGTCCCTTAA
ACGACAATTCGACAAGGAGTACATCGACTTCGACACTGCCTTCGTAGATCTTGCGTTTAA
GGACCTAGCCCACCAACCTACAAACCGGTCGAGGGTTTATCGAGAACCCCTCGTTCTACT
CCGCTGTACGAACTGCAACGAATACCAAGCCAGCTAGCTTAATATAAAGCCAGAGCTCTA
CTAGTGGAGTAGG
sehlly@ShellysPC:~/Documents/Mycosphaerella_sequence1$

My code is:

#!/usr/bin/python
import re
import sys

Found = False
InputID = raw_input("Input Id :")

with open ('Mgraminicola.fasta','r') as f:
              seqs = f.readlines()

for seq in seqs:
          if re.search (InputID,seq):
                  Found = True
                  sys.stdout.write(seq)
                    continue
if Found == True:
                    if re.search ('>ENA|CM00',seq):      
                              Found = False
                    if Found:
                               sys.stdout.write(seq)

Any help would be highly appreciated.

Thanks

sequence • 515 views
ADD COMMENTlink modified 23 months ago by Macspider3.0k • written 23 months ago by rupashree.choudhury10
1

Have a look at: http://biopython.org/wiki/SeqIO

Also, you don't need to separate each code line with an empty line!

ADD REPLYlink written 23 months ago by Macspider3.0k
1
gravatar for lakhujanivijay
23 months ago by
lakhujanivijay4.5k
India
lakhujanivijay4.5k wrote:

Bingo, I have my own code in Python


Usage

fetch_sequence_by_id.py -i your_fasta_file.fa -f file_containing_sequence_ids.txt -o output_fasta_file.fa

Program

from Bio import SeqIO
import getopt, sys, re
id_list=[]
sequence_list={}

def usage():
    print "Usage: fetch_sequence_by_id.py -i <input_fasta> -f <file_containing_sequence_ids> -o <output_fasta_file>"


try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:f:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-f'):
        id_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out_fasta=open(output_file, 'w')
out_id_file=open(id_file)

for ids in out_id_file:
    ids=ids.rstrip()
    id_list.append(ids)

for record in SeqIO.parse(input_file, "fasta"):
    sequence_list[record.id]=record.seq


for ids in id_list:
    for record in SeqIO.parse(input_file, "fasta"):
        if ids in strrecord.id):
            out_fasta.write(">" + ids +  "\n")
            out_fasta.write(str(record.seq) + "\n")

out_fasta.close()
ADD COMMENTlink modified 23 months ago • written 23 months ago by lakhujanivijay4.5k
1

Ultra fast solution using seqkit

cat Your_fasta.fa | seqkit grep -f your_ids.txt > required_fasta.fa
ADD REPLYlink written 23 months ago by lakhujanivijay4.5k
2

Ultra-fast solution has to be faSomeRecords from Jim Kent.

ADD REPLYlink written 23 months ago by genomax74k
1

Thanks for updating me. I dint know that

ADD REPLYlink written 23 months ago by lakhujanivijay4.5k
1
gravatar for Macspider
23 months ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

Same here, clone if you want:

https://github.com/MatteoSchiavinato/Utilities/blob/master/select-sequences-from-filename.py

ADD COMMENTlink written 23 months ago by Macspider3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1259 users visited in the last hour