Question

extract sequences from multi fasta using partial ID

0

Entering edit mode

10.5 years ago

dago ★ 2.8k

I have a file containing IDs:

YP_615060
YP_615061
YP_615062

and a multifasta file with IDs:

>gi|15604718|ref|NP_219502.1| hypothetical protein CT_875 [Chlamydia trachomatis D/UW-3/CX]

Now, I would like to extract all seq that contain the IDs in my IDs_file.

I tried:

cat prot_id.txt | xargs -n 1 samtools faidx all_bact_prot.faa
>YP_615060
[fai_fetch] Warning - Reference YP_615060 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11

Of course it did not work because the IDs do not match. Any idea how I can extract sequences from multi fasta matching only a part of the ID?

Update

I tried this great python code

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')</code>

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

that I found here, but I do not understand why I get an empty file as file 3

sequence fasta python • 9.4k views

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 10.5 years ago by dago ★ 2.8k

0

Entering edit mode

Adding a question to an existing post is not a good practice. It's always better to create a new post with the new question, so we can focus on one issue at a time.

Check out my updated answer - it has the reason why the python code doesn't work.

ADD REPLY • link 6.1 years ago by Ram 45k

0

Entering edit mode

Hi,

I am trying to understand RamRS explanation but I am a bit lost.

I have also an input fasta file with header such as

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogene
>gi|316996529|dbj|AB545981.1|:415480-416630 Pyrus pyrifolia genes for F-box proteins, S ribonuclease, complete cds, haplotype: S4
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

and I want to retrieve only specific sequences, from an ID.txt:

AB270792.1
AB545982.1

At the end I would like a file like this

>gi|125995253|dbj|AB270792.1|:125515-126741 Malus x domestica MdSFBB9-alpha, S9-RNase, MdSFBB9-beta genes, complete cds, Psi-SFBB9-alpha, Psi-SFBB9-beta pseudogenes
>gi|316996537|dbj|AB545982.1|:302712-303934 Pyrus pyrifolia genes for F-box proteins and S2-RNase, complete cds, haplotype: S2

But I am not sure if I understand how you modify the python code above (message from Dago) with the command line using xargs faidx. Could you please show update the Python script?

I tried the xargs faidx -d '|' all_bact_prot.faa < prot_id.txt with my infile, it generated a .fai, and I therefore don't understand the link with the python code...

thanks a lot for your help!!!

cheers

Amandine

ADD REPLY • link updated 6.1 years ago by Ram 45k • written 7.4 years ago by amandine.cornille • 0

0

Entering edit mode

This should be a comment on the relevant post, not an answer.

ADD REPLY • link 7.4 years ago by Ram 45k

1

Entering edit mode

10.5 years ago

Ram 45k

bioawk.

bioawk -c fastx '$name ~ /pattern/ { print ">"$name"\n"$seq"\n"; }'

Your part is to find out how to loop this with the contents from the ID file.

EDIT: Let's use xargs to capture input

xargs -0 -I bioawk -c fastx '$name ~ /{}/ { print ">"$name"\n"$seq"\n"; }' < IDs.txt

Your added steps do not work because the assumption of the code you used is that your accession IDs file has full IDs separated by |, whereas they do not. Your FASTA file is the one that has IDs with |

ADD COMMENT • link 6.1 years ago by Ram 45k

0

Entering edit mode

We should change the name of the site from biostars to bioawk.

ADD REPLY • link 10.5 years ago by Matt Shirley 10k

0

Entering edit mode

I just find bioawk really useful :) Apologies if I'm a little biased.

ADD REPLY • link 10.5 years ago by Ram 45k

Ram · Accepted Answer · 2015-01-09

4

Entering edit mode

10.5 years ago

Matt Shirley 10k

Direct drop-in for your current command. Install pyfaidx:

pip install --user pyfaidx

xargs faidx -d '|' all_bact_prot.faa < prot_id.txt

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.5 years ago by Matt Shirley 10k