Question: extract sequences from multi fasta using partial ID
0
gravatar for dago
4.3 years ago by
dago2.5k
Germany
dago2.5k wrote:

I have a file containing IDs:

YP_615060
YP_615061
YP_615062

and a multifasta file with IDs:

>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]

Now, I would like to extract all seq that contain the IDs in my IDs_file.

I tried;

    cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
    >YP_615060
    [fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
    xargs: samtools: terminated by signal 11

Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?

Update

I tried this great python code

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

 

that I found here, but I do not understand why I get an empty file as file 3

sequence python fasta • 4.5k views
ADD COMMENTlink modified 14 months ago by amandine.cornille0 • written 4.3 years ago by dago2.5k

Not a good practice, adding a question to an existing post. It's always better to create a new post with the new question, so we can focus on one issue at a time.

Check out my updated answer - it has the reason why the python code doesn't work.

ADD REPLYlink written 4.3 years ago by RamRS21k
4
gravatar for Matt Shirley
4.3 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

Direct drop-in for your current command. Install pyfaidx:

pip install --user pyfaidx
xargs faidx -d '|' all_bact_prot.faa < prot_id.txt
ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Matt Shirley8.9k
0
gravatar for RamRS
4.3 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

bioawk.

bioawk -c fastx '$name ~ /pattern/ { print ">"$name"\n"$seq"\n"; }'

Your part is to find out how to loop this with the contents from the ID file.

EDIT: Let's use xargs to capture input

xargs -0 -I bioawk -c fastx '$name ~ /{}/ { print ">"$name"\n"$seq"\n"; }' <IDs.txt

Your added steps do not work because the assumption of the code you used is that your accession IDs file has full IDs separated by '|', whereas they do not. Your FASTA file is the one that has IDs with '|'

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by RamRS21k

We should change the name of the site from biostars to bioawk.

ADD REPLYlink written 4.3 years ago by Matt Shirley8.9k

I just find bioawk really useful :) Apologies if I'm a little biased.

ADD REPLYlink written 4.3 years ago by RamRS21k
0
gravatar for amandine.cornille
14 months ago by
amandine.cornille0 wrote:

Hi,

I am trying to understand your explanation but I am a bit lost.

I have also an input fasta file with header such as

gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogene

gi|316996529|dbj|AB545981.1|:415480-416630 Pyrus pyrifolia genes for F-box proteins, S ribonuclease, complete cds, haplotype: S4

gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

and I want to retrieve only specific sequences, from an ID.txt: AB270792.1 "AB545982.1

At the end I would like a file like this

gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogenes gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using xargs faidx. Could you please show update the Python script?

I tried the xargs faidx -d '|' all_bact_prot.faa < prot_id.txt with my infile, it generated a .fai, and I therefore don't understand the link with the python code...

thanks a lot for your help!!! cheers Amandine

ADD COMMENTlink written 14 months ago by amandine.cornille0

This should be a comment on the relevant post, not an answer.

ADD REPLYlink written 14 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 846 users visited in the last hour