Question: Extracting specific IDs + sequence from multifasta
0
gravatar for maximilian.mayerhofer
7 months ago by
maximilian.mayerhofer20 wrote:

Hey,

I got a multifasta file like this

>stuff1;[gene1];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff1;[gene2];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff   
GTACTACATCGCTAGCACTACT
>stuff2;[gene2];stuff,morestuff    
GTAGTCATCAGCTACGACTACT

So between each ID and sequence is a new line. I want to extract e.q. all IDs and their sequences with [gene1], basically search the ID for a term and then extract ID and seq into a new fasta file with the filename of the extracted term. It is important that the complete ID is extracted, but the "search term" is just short ( in this case, [gene1])

I tried awk and grep

awk'/[gene1]/' RS='>' input.fasta > output.fasta

grep "[gene1]" input.fasta > output.fasta

But this just gave me all lines after [gene1] in both cases.

When searching for [gene1], i need a new multifasta like this:

>stuff1;[gene1];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff   
GTACTACATCGCTAGCACTACT

Best Regards

multifasta • 620 views
ADD COMMENTlink modified 7 months ago by Buffo1.1k • written 7 months ago by maximilian.mayerhofer20
2

What do you mean by

So between each ID and sequence is a new line.

Do you mean empty new line? If the sequences do not contain new line character (i.e. are not folded) then you can try:

grep  -A1 "[gene1]" input.fasta > output.fasta
ADD REPLYlink modified 7 months ago • written 7 months ago by Sej Modha3.6k

I meant that the header and the following seq are on seperate lines. There is no empty line between. Would it be easier if they were on the same line?

ADD REPLYlink written 7 months ago by maximilian.mayerhofer20

Ok, my problem was: since I searched for my term in brackets "[gene1]" instead of "gene1" it searched for all possible matchings, like: Is there g, ge, gen,and so on.

grep -A1 "term" input.fasta > output.fasta

is working

ADD REPLYlink written 7 months ago by maximilian.mayerhofer20

You may be interested in using SEDA (http://www.sing-group.org/seda/), which has different functions for extracting and filtering sequence identifiers. Regards.

ADD REPLYlink written 7 months ago by Hugo130
grep "[gene1]"

or

awk /[gene1]/

are POSIX Bracket Expressions https://www.regular-expressions.info/posixbrackets.html .

$ echo 'e' | grep '[gene1]'
e

you want something like grep -A 1 --no-group-separator -F '[gene1]' if there is only one sequence after the header.

ADD REPLYlink modified 7 months ago • written 7 months ago by Pierre Lindenbaum112k
2
gravatar for grant.hovhannisyan
7 months ago by
grant.hovhannisyan1.1k wrote:

If you have long format of fasta file (the sequence is in one line), then you can use grep with -A1 option. Otherwise one clean solution would be with biopython SEQIO module.

ADD COMMENTlink modified 7 months ago • written 7 months ago by grant.hovhannisyan1.1k
1
gravatar for maximilian.mayerhofer
7 months ago by
maximilian.mayerhofer20 wrote:

Ok, my problem was: since I searched for my term in brackets "[gene1]" instead of "gene1" it searched for all possible matchings, like: Is there g, ge, gen,and so on.

grep -A1 "term" input.fasta > output.fasta

is working for me.

ADD COMMENTlink written 7 months ago by maximilian.mayerhofer20
0
gravatar for jrj.healey
7 months ago by
jrj.healey6.8k
United Kingdom
jrj.healey6.8k wrote:

You can use my general purpose fasta-fetching script below for this. Just give it a key, or a file full of keys that you want to retrieve. It's 'greedy' so it'll look for the key anywhere in the header, so just make sure your headers are all unique enough that you don't get any overlap (or tweak the code slightly e.g. if header == x rather than if x in header)

ADD COMMENTlink written 7 months ago by jrj.healey6.8k
0
gravatar for Buffo
7 months ago by
Buffo1.1k
Buffo1.1k wrote:

Save the code below as patt_to_fasta.py and run it:

patt_to_fasta.py file.fasta

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

import sys
if len(sys.argv) != 2:
    print syntax
    sys.exit()
prefix = sys.argv[1].split('.')[0]
seq = ''
count = 0
#you can automate searching using: pattern = raw_input(Write pattern to search in fasta names: ):
with open (sys.argv[1], 'r') as infile, open(prefix + '_newfa.fasta', 'w') as outfile: #and also this line for :with open (sys.argv[1], 'r') as infile, open(prefix + '_' + str(pattern) + '.fasta', 'w') as outfile:
    for line in infile:
        line = line.rstrip('\n')
        if line.startswith('>'):
            if seq:
                if '[gene1]' in name:                          #change this line if you want to do it for more than one pattern for: if str(pattern) in name:   
                    count += 1
                    outfile.write(name + '\n' + seq + '\n')
                seq  = ''
            name = line
        else:
            seq = seq + line
    if '[gene1]' in name:
        count += 1  
        outfile.write(name + '\n' + seq + '\n')
print '\n' + '\n' + 'File ' + prefix + '_newfa.fasta has been created..'
print 'This new fasta contains ' + str(count) + ' sequences...'

If you want to do it for more than one pattern change lines marked with '#'

ADD COMMENTlink written 7 months ago by Buffo1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 632 users visited in the last hour