Question

How to know that required sequence has been extracted

0

Entering edit mode

6.4 years ago

rupashree.choudhury ▴ 10

Hi

I am new to python. I have a multi-fasta file containing , 21 merged fasta file of Mycosphaerella graminicola chr1-21 ,whole genome shotgun sequence.I have a python code that takes Id as input(>ENA|CM001216|CM001216.1) and extract respective sequence.

My problem is, I am not sure how to make sure that right sequences is being extracted. On linux , console, the ouput prints something like below, the huge chunk of sequence without any track of starting with Ids

ACATATTGCAGAGCATGGAGTTCTCCCGGTCTACAGTTAAGACCAGATCCGCTACATCGC
ATAGGAAAGACTCCTCGCTAGCAGCACTAGTGCACCGGACGATGACCGACCGGTCCTATC
GCTCGAGCGGTCCTATCTAACACACCAGCAGCGGCCCTTTACAATAACGCGCCGGTAGGA
CATTGATGCCCTATTCCTACCACTAACCTCACTTGCAGCCTATCGGCATAGCATCTAGTT
CTGTTACCAGCTACGGTTTACAAAGACTCTTTCTAGCAACTGGCACACACCTATCGGCCG
GTTAGGCATCCGCCCAACGATATGCAAGCACTTCCGGCTCGCCCGTAGCTTTAGTGCTAC
TAGCTTCGATACCTACATCTTCTTCCCAAACCTACCTTTGCTATCTACAAAGCGGTAACG
AGCGGTCGACAAACAGCGCTACTGCTACCTTAACCGGATCTAGCAAGCGCATTTTATTAA
CGAAATCTTTATACCTACCCTCGAAGACACCTACTAAGACGACATTCTCCAGCACCATCC
TCGCACTTTCGAACAGGCGTATAACCGGTCCTACGCTCGCTAGCGGCAGTCGATTCGGTC
GACCAACCGGTCGTTTAGCGAGCTATTCTACGACCTTCCAAATACATGGCCGAGCGATCA
GGCCCGGCGTCCTAGGGTACCACTGCTCGCGCGTCTCTAGACCCGTATTTGTGAACGCCT
AGCAGACGACGAAGAGGACTAGCAGGAAGCGGAACTGATTACAGTACGCTACGGCTACAA
GCTCGAGCTCCGTAAGACGTCTGTCGCCAGCCTTCGGCAAGTGGTCGACCGGTCCCTTAA
ACGACAATTCGACAAGGAGTACATCGACTTCGACACTGCCTTCGTAGATCTTGCGTTTAA
GGACCTAGCCCACCAACCTACAAACCGGTCGAGGGTTTATCGAGAACCCCTCGTTCTACT
CCGCTGTACGAACTGCAACGAATACCAAGCCAGCTAGCTTAATATAAAGCCAGAGCTCTA
CTAGTGGAGTAGG
sehlly@ShellysPC:~/Documents/Mycosphaerella_sequence1$

My code is:

#!/usr/bin/python
import re
import sys

Found = False
InputID = raw_input("Input Id :")

with open ('Mgraminicola.fasta','r') as f:
              seqs = f.readlines()

for seq in seqs:
          if re.search (InputID,seq):
                  Found = True
                  sys.stdout.write(seq)
                    continue
if Found == True:
                    if re.search ('>ENA|CM00',seq):      
                              Found = False
                    if Found:
                               sys.stdout.write(seq)

Any help would be highly appreciated.

Thanks

sequence • 1.3k views

ADD COMMENT • link updated 6.4 years ago by Matteo Schiavinato ★ 3.6k • written 6.4 years ago by rupashree.choudhury ▴ 10

1

Entering edit mode

Have a look at: http://biopython.org/wiki/SeqIO

Also, you don't need to separate each code line with an empty line!

ADD REPLY • link 6.4 years ago by Matteo Schiavinato ★ 3.6k

score 1 · Answer 1 · 2017-12-14

1

Entering edit mode

6.4 years ago

lakhujanivijay 5.8k

Bingo, I have my own code in Python

Usage

fetch_sequence_by_id.py -i your_fasta_file.fa -f file_containing_sequence_ids.txt -o output_fasta_file.fa

Program

from Bio import SeqIO
import getopt, sys, re
id_list=[]
sequence_list={}

def usage():
    print "Usage: fetch_sequence_by_id.py -i <input_fasta> -f <file_containing_sequence_ids> -o <output_fasta_file>"


try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:f:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-f'):
        id_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out_fasta=open(output_file, 'w')
out_id_file=open(id_file)

for ids in out_id_file:
    ids=ids.rstrip()
    id_list.append(ids)

for record in SeqIO.parse(input_file, "fasta"):
    sequence_list[record.id]=record.seq


for ids in id_list:
    for record in SeqIO.parse(input_file, "fasta"):
        if ids in strrecord.id):
            out_fasta.write(">" + ids +  "\n")
            out_fasta.write(str(record.seq) + "\n")

out_fasta.close()

ADD COMMENT • link 6.4 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Ultra fast solution using seqkit

cat Your_fasta.fa | seqkit grep -f your_ids.txt > required_fasta.fa

ADD REPLY • link 6.4 years ago by lakhujanivijay 5.8k

2

Entering edit mode

Ultra-fast solution has to be faSomeRecords from Jim Kent.

ADD REPLY • link 6.4 years ago by GenoMax 142k

1

Entering edit mode

Thanks for updating me. I dint know that

ADD REPLY • link 6.4 years ago by lakhujanivijay 5.8k

score 1 · Answer 2 · 2017-12-14

1

Entering edit mode

6.4 years ago

Matteo Schiavinato ★ 3.6k

Same here, clone if you want:

https://github.com/MatteoSchiavinato/Utilities/blob/master/select-sequences-from-filename.py

ADD COMMENT • link 6.4 years ago by Matteo Schiavinato ★ 3.6k