Question: extract sequences from multi fasta using partial ID
0
gravatar for dago
5.0 years ago by
dago2.6k
Germany
dago2.6k wrote:

I have a file containing IDs:

YP_615060
YP_615061
YP_615062

and a multifasta file with IDs:

>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]

Now, I would like to extract all seq that contain the IDs in my IDs_file.

I tried:

cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
>YP_615060
[fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11

Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?

Update

I tried this great python code

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')</code>

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

that I found here, but I do not understand why I get an empty file as file 3

sequence python fasta • 5.2k views
ADD COMMENTlink modified 7 months ago by RamRS25k • written 5.0 years ago by dago2.6k

Adding a question to an existing post is not a good practice. It's always better to create a new post with the new question, so we can focus on one issue at a time.

Check out my updated answer - it has the reason why the python code doesn't work.

ADD REPLYlink modified 7 months ago • written 5.0 years ago by RamRS25k

Hi,

I am trying to understand RamRS explanation but I am a bit lost.

I have also an input fasta file with header such as

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogene
>gi|316996529|dbj|AB545981.1|:415480-416630 Pyrus pyrifolia genes for F-box proteins, S ribonuclease, complete cds, haplotype: S4
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

and I want to retrieve only specific sequences, from an ID.txt:

AB270792.1
AB545982.1

At the end I would like a file like this

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogenes
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using xargs faidx. Could you please show update the Python script?

I tried the xargs faidx -d '|' all_bact_prot.faa < prot_id.txt with my infile, it generated a .fai, and I therefore don't understand the link with the python code...

thanks a lot for your help!!!

cheers

Amandine

ADD REPLYlink modified 7 months ago by RamRS25k • written 23 months ago by amandine.cornille0

This should be a comment on the relevant post, not an answer.

ADD REPLYlink written 23 months ago by RamRS25k
4
gravatar for Matt Shirley
5.0 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

Direct drop-in for your current command. Install pyfaidx:

pip install --user pyfaidx
xargs faidx -d '|' all_bact_prot.faa < prot_id.txt
ADD COMMENTlink modified 3 months ago by RamRS25k • written 5.0 years ago by Matt Shirley9.2k
1
gravatar for RamRS
5.0 years ago by
RamRS25k
Houston, TX
RamRS25k wrote:

bioawk.

bioawk -c fastx '$name ~ /pattern/ { print ">"$name"\n"$seq"\n"; }'

Your part is to find out how to loop this with the contents from the ID file.

EDIT: Let's use xargs to capture input

xargs -0 -I bioawk -c fastx '$name ~ /{}/ { print ">"$name"\n"$seq"\n"; }' < IDs.txt

Your added steps do not work because the assumption of the code you used is that your accession IDs file has full IDs separated by |, whereas they do not. Your FASTA file is the one that has IDs with |

ADD COMMENTlink modified 7 months ago • written 5.0 years ago by RamRS25k

We should change the name of the site from biostars to bioawk.

ADD REPLYlink written 5.0 years ago by Matt Shirley9.2k

I just find bioawk really useful :) Apologies if I'm a little biased.

ADD REPLYlink written 5.0 years ago by RamRS25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 930 users visited in the last hour