Question: extract sequences from multi fasta using partial ID
0
gravatar for dago
6.0 years ago by
dago2.6k
Germany
dago2.6k wrote:

I have a file containing IDs:

YP_615060
YP_615061
YP_615062

and a multifasta file with IDs:

>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]

Now, I would like to extract all seq that contain the IDs in my IDs_file.

I tried:

cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
>YP_615060
[fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11

Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?

Update

I tried this great python code

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')</code>

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

that I found here, but I do not understand why I get an empty file as file 3

sequence python fasta • 6.2k views
ADD COMMENTlink modified 19 months ago by _r_am32k • written 6.0 years ago by dago2.6k

Adding a question to an existing post is not a good practice. It's always better to create a new post with the new question, so we can focus on one issue at a time.

Check out my updated answer - it has the reason why the python code doesn't work.

ADD REPLYlink modified 19 months ago • written 6.0 years ago by _r_am32k

Hi,

I am trying to understand RamRS explanation but I am a bit lost.

I have also an input fasta file with header such as

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogene
>gi|316996529|dbj|AB545981.1|:415480-416630 Pyrus pyrifolia genes for F-box proteins, S ribonuclease, complete cds, haplotype: S4
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

and I want to retrieve only specific sequences, from an ID.txt:

AB270792.1
AB545982.1

At the end I would like a file like this

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogenes
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using xargs faidx. Could you please show update the Python script?

I tried the xargs faidx -d '|' all_bact_prot.faa < prot_id.txt with my infile, it generated a .fai, and I therefore don't understand the link with the python code...

thanks a lot for your help!!!

cheers

Amandine

ADD REPLYlink modified 19 months ago by _r_am32k • written 3.0 years ago by amandine.cornille0

This should be a comment on the relevant post, not an answer.

ADD REPLYlink written 3.0 years ago by _r_am32k
4
gravatar for Matt Shirley
6.0 years ago by
Matt Shirley9.5k
Cambridge, MA
Matt Shirley9.5k wrote:

Direct drop-in for your current command. Install pyfaidx:

pip install --user pyfaidx
xargs faidx -d '|' all_bact_prot.faa < prot_id.txt
ADD COMMENTlink modified 15 months ago by _r_am32k • written 6.0 years ago by Matt Shirley9.5k
1
gravatar for _r_am
6.0 years ago by
_r_am32k
Baylor College of Medicine, Houston, TX
_r_am32k wrote:

bioawk.

bioawk -c fastx '$name ~ /pattern/ { print ">"$name"\n"$seq"\n"; }'

Your part is to find out how to loop this with the contents from the ID file.

EDIT: Let's use xargs to capture input

xargs -0 -I bioawk -c fastx '$name ~ /{}/ { print ">"$name"\n"$seq"\n"; }' < IDs.txt

Your added steps do not work because the assumption of the code you used is that your accession IDs file has full IDs separated by |, whereas they do not. Your FASTA file is the one that has IDs with |

ADD COMMENTlink modified 19 months ago • written 6.0 years ago by _r_am32k

We should change the name of the site from biostars to bioawk.

ADD REPLYlink written 6.0 years ago by Matt Shirley9.5k

I just find bioawk really useful :) Apologies if I'm a little biased.

ADD REPLYlink written 6.0 years ago by _r_am32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1081 users visited in the last hour
_