Question

Pulling out gene names from a FASTA file with a list of IDs

1

Entering edit mode

6.6 years ago

jvire1 ▴ 10

Does anyone know of a simple scripting solution to take a list of accessions and pull out the gene name from the header of FASTA sequences?

For instance given the accession: XP_016469325.1

and given the FASTA entry:

XP_016469325.1 Nicotiana tabacum|C3H|C3H family protein MEEELLKRNTDCVYFLASPLTCKKGIECEYRHSEIARLNPRDCWYWLAESCLNPTCAFRH PPLESHAETSSESAPPQHKSAVPVNKTNVPCYFYFNGYCIKGERCSFLHGPDDGTTTWKS SKIASGVPDGPTAEKKTSVGSETGPASVEKPSNSSETGSKAAAHEYIKSQVDLISMTNDV GEQSASHETSGSPSEEATAVRLDSLVPAEGFTQGGSDLSPDWSSDEEVEDNVEREEWLES SPGFDVLVDDRIEGWSHKDDHSYLLQHDRECDERFAGYDFENNLEYDPAYPDMRIVSDEE LDDSYYSKVESHEVNEYAREIVIPAHGRQSIPHKRKFPREPGFCARGNVDLRDLLKKRRV IESDPPNYLSRRLDLSRFNAREQCRDRHRPQGSRWMPQSLASKLESNSSFSSGFVDATRL EGANQLKKLRQSHRSSYRQQHFKDRRRGRSQPFANETPRRMASRQRSTEVPKIFGGPKTL AQIREEKIKGREDGNSFERTVPSGGSEREDFSGPKPLSEILKDKRRLSSVVNFSN

I would like to have output the gene name "C3H" that is spanned by the "|"

This script I modified from a previous post can grab the gene names, however I'm not sure how to only get the gene names corresponding to a separate list of accessions (accessions.list).

with open('PlantTFDB_ALL_TF_pep.fas','r') as f:
    for line in f:
        if '>' in line:
            line = line.strip().split('|')
            print(line[1])

RNA-Seq • 7.5k views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 6.6 years ago by jvire1 ▴ 10

2

Entering edit mode

6.6 years ago

st.ph.n ★ 2.7k

Make another file with the acc #, and check if the header for the accesion number.

with open('acc_list.txt', 'r') as f:
    acc = [line.strip() for line in f]

with open('PlantTFDB_ALL_TF_pep.fas', 'r') as f:
    for line in f:
        if line.startswith(">"):
            if line.strip().split('>')[1].split(' ')[0] in acc:
                print line.strip().split('|')[1]

ADD COMMENT • link 6.6 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

After I looked at the output more closely I realized that it isn't doing quite what I want it to.

The beginning of the list of identifiers looks like this:

Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
KZV22762.1
KZV22762.1
KZV22762.1

However the beginning of the results look like this:

WRKY
B3
STAT
MYB
Dof
bHLH
NAC
HD-ZIP

I'm not very familiar with scripting yet, but if I understand the logic of the code:

with open('acc_list.txt', 'r') as f:
    acc = [line.strip() for line in f]

with open('PlantTFDB_ALL_TF_pep.fas', 'r') as f:
    for line in f:
        if line.startswith(">"):
            if line.strip().split('>')[1].split(' ')[0] in acc:
                print line.strip().split('|')[1]

It is reading through each FASTA entry of the 'PlantTFDB_ALL_TF_pep.fas' and checking if it exists in the 'acc_list.txt' and then printing the parsed out gene name?

Is there a way to instead search each identifier from the list against the FASTA file and then print each corresponding gene name (including duplicate instances) like so:

NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
CAMTA
CAMTA
CAMTA

ADD REPLY • link 6.6 years ago by jvire1 ▴ 10

0

Entering edit mode

Can you provide more of your input file? You've asked that the field delimited by | symbols be returned, based on the accession it's paired with in the fasta header as far as we can work out - correct?

If you aren't getting the results you expect, it's likely you fasta headers aren't all in exactly the same format, and the parsing of the headers is breaking down.

ADD REPLY • link 6.6 years ago by Joe 21k

0

Entering edit mode

Thank you for all your help. Here is three FASTA entries in the PlantTFDB FASTA file, with the first one corresponding to the first accession ID in the acc_list.txt:

>Zpz_sc04830.1.g00020.1.sm.mk Zoysia pacifica|NF-YA|NF-YA family protein
MEAKFFRFLKLVGVGFKARSESQGRELFLKLGYSHEVLFTAPPAVRVFCFKPNIICCTGI
DKDRVHNFAGAVRSCKPPEVYKGKGILYFDEVIKLKPGKKQKKQMNNCPRLPHLEGVQYP
GSTAEVPEGSATATPHMSVDNSGPSELIAGQRRALRLRQLPSLVPGRATIFVSAKQYEAI
LRRRTQRAKAGISRPSIKGGANTDISQATSSVPAEDFYGPFVATNEDEEHVGEDAFDSIL
NLNSPDFTTLLRLMMGDKYNTDISEDVTWILM
>Zpz_sc03704.1.g00020.1.am.mk Zoysia pacifica|MYB|MYB family protein
MGRSPCCEKAHTNKGTWTKEEDQRLTMGRSPCCEKAHTNKGTWTKEEDQRLVAYIKAHGE
GCWRSLPKAAGLQRCGKSCRLRWINYLRPDLKRGNFTDEEDDLIIKLHQVLGNKWSMIAG
RLPGRTDNEIKNYWNTHIKRKLLARGIDPKTHRPLSEAATAAASSSTHLQDHPARPSSSP
ETSCCAGQSGDEGSASGSLPETSTQRQLASCIIDLNLSIGPPQHQPSSSPPPPPTRPEAQ
AAGSSTTATAGTSSSSNSEKERICLPRNIQPSRAIFSPSCLMPASFSHTREAQSLPSSSC
RDLPAATPPSPAALRPATVPLAAAPPSVLLPCRSSPRALTRTAPCALTHDCTSLRS
>Zpz_sc00930.1.g00160.1.am.mk Zoysia pacifica|WRKY|WRKY family protein
LADEVAKQGAAAVVKPVVLPLASRMPPSPFLPSPRLLSDSADGPSMAAATRTGGWRIGPR
TTDPAAGTADRRQQRWIRRLGRVTGSRDAGPEVGTAARRTARGGEGRGKAEVRWASLPRL
AVVTTVLRLVRDQLLACSSCGGTAASARPSTATGPRGRRSPAGEQRRGGDGDAGDRGWGC
GLSRATGNNTSLTRIMAQISVHHGTKAPRPIPNLNAWINGGSFILPDGRWVKIRLSSGCC
CCPCDAAVSACSGKGTSAATNQRGGAGGDASSRTPAKKLRRHQIGSEARCETRGARALIK
TATRRTVGCRMHVALVQARATAAACLSSHPHAMEYRHPYSSGSASASANYGDASPAADFR
PAAFGAGRPAVRRAPLDVFDYLSDEGAQPAPAAVPAVPTFRSPPPRSMPAVQVVPDAAGY
GHAARPRGFMYLNIFFPRGAASVAVAAEEPTTMTDRIAFRMRSDKEVLDDGYRWRKYGKK
AVKNSPNPRGAASVAVAAEEPTTMTDRIAFRMRSDKEVLDDGYRWRKYGKKAVKNSPNPR
NYYRCSTEGCNVKKRVERDRDDPSYVVTMYEGVHNHASPGTVYYATQDAASGRFFVSGTH
HSSGPLKL

I have manually looked through the FASTA, which contains 320,370 protein entries and all I have observed have headers in the same format.

The script works great at pulling out the gene name from the headers, I think the issue might be with the:

if line.strip().split('>')[1].split(' ')[0] in acc:

I'm not sure if I understand this right, but it seems like the script is:

1. going through each FASTA entry one by one, 
2. checking if the header matches an entry in the acc_list.txt, 
3. printing the gene name if the FASTA header contains any accession ID in acc_list.txt.

What I'm aiming for is for each line in acc_list.txt to be searched for the matching accession in the FASTA and print out the gene name. The parsing out of gene names works fine, I'm just unsure of the logical argument that could take each line (including duplicates) one at a time in the acc_list.txt and search it against the FASTA file.

ADD REPLY • link 6.6 years ago by jvire1 ▴ 10

2

Entering edit mode

6.6 years ago

Alex Reynolds 35k

Via awk:

$ awk '($0~/^>/){ n=split($0,a,"|"); print a[2]; }' in.fa > out.txt

ADD COMMENT • link 6.6 years ago by Alex Reynolds 35k

1

Entering edit mode

6.6 years ago

Joe 21k

I realise you've already got an answer, but if you want a more 'fully fledged' alternative, this is my day-to-day general purpose fasta retrieval script:

You can look for a fasta by a keyword within it's header, and it'll pull out the sequences. You can pass in a list or a string as the search query. To get your sequences back with just the |CH3| bit, you can just tweak the lines at the bottom of the script to output them as you wish.

ADD COMMENT • link 6.6 years ago by Joe 21k

1

Entering edit mode

6.6 years ago

shenwei356 8.4k

I see. firstly, getting the mapping relationship between acc and gene name and saving them to acc2gene.tsv

$ grep '>' seqs.fa  | \
    perl -ne 'next unless />(\S+).+\|(.+?)\|/;   print "$1\t$2\n";'
XP_016469325.1  C3H

secondly, joining acc.txt and acc2gene.tsv using csvtk (http://bioinf.shenwei.me/csvtk/download/) or other tools：

$ csvtk join -H -t -k acc.txt acc2gene.tsv | csvtk cut -H -t -f 2

ADD COMMENT • link 6.6 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you so much! This worked and I was impressed how fast it processed everything. I am still going to see if I can get the solution in python, because I will likely need to be able to do similar tasks and it would be nice to know how to do it myself.

Here is the output:

NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
CAMTA
CAMTA
CAMTA

ADD REPLY • link 6.6 years ago by jvire1 ▴ 10

score 2 · Accepted Answer · 2017-09-14

2

Entering edit mode

6.6 years ago

shenwei356 8.4k

File acc.txt contains the accession IDs:

$ cat acc.txt 
XP_016469325.1

Firstly, getting the mapping relationship between acc and gene name and saving them to acc2gene.tsv

$ grep '>' seqs.fa  | \
    perl -ne 'next unless />(\S+).+\|(.+?)\|/;   print "$1\t$2\n";'
XP_016469325.1  C3H

Secondly, joining acc.txt and acc2gene.tsv using csvtk (http://bioinf.shenwei.me/csvtk/download/) or other tools：

$ csvtk join -H -t -k acc.txt acc2gene.tsv | csvtk cut -H -t -f 2

Original answer

Searching header line using grep -f, and capturing accession ID and gene name using regular expression in Perl.

$ grep '>' seqs.fa  | \
    grep -f acc.txt | \
    perl -ne 'next unless />(\S+).+\|(.+?)\|/;   print "$1\t$2\n";'
XP_016469325.1  C3H

If you just want the gene names, that would be easier:

$ grep '>' seqs.fa | grep -f acc.txt | cut -d '|' -f 2
C3H

ADD COMMENT • link 6.6 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you for this awk approach, unfortunately I am needing each line in the accession to be searched one at a time against the FASTA database, retaining duplicate instances. The idea here was to take the accessions and get back the gene name for each. For example an acc.txt file that looked like this:

Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
Zpz_sc04830.1.g00020.1.sm.mk
KZV22762.1
KZV22762.1
KZV22762.1

I would expect to see this:

NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
NF-YA
CAMTA
CAMTA
CAMTA

The output of this awk command however is this:

G2-like
GeBP
bZIP
ERF
B3
B3

which looks like it is just extracting the gene name for each FASTA entry regardless of the acc.txt:

loki@rnoyes-Precision-T3610:~/Desktop/PlantTFDB/apomict$ head -10 PlantTFDB_ALL_TF_pep.fas 
>KFK36254.1 Arabis alpina|G2-like|G2-like family protein
MYSAIRSLPVDGSMGEYSDGTNLPIEACLVLTTDPKPRLRWTTELHDRFVDAVTQLGGPD
KATPKTIMRTMGVKGLTLYHLKSHLQKFRLGRLPCKESTDNSKDVSSVAESQDTGSSTTS
SLRLAAQEQKESYQVTEALRAQMEVQRRLHEQLEVQQRLQLRIEAQGKYLQTILEKACKA
IEEQAVAFAGLEAAREELSELAIKVSNGCQSTFDTTKMTIPSLSELAVAIEHKNNCSAES
SLTSSTVGSPVSAAMMKKRHRGVFGNGDSVVVGHEAGWVMPSSSIG
>KFK25038.1 Arabis alpina|GeBP|GeBP family protein
MAPKHTDTIQNPPMPSSSSEEESASSGEDSDSSKKHNSASDSDPIKTKIVVTKPESSGSM
KTTTKSSVVVAEPESTTAKRPLKETEEDVEVKKKKMKTELVMKNQEPEEKVSGDETKKQL
MFQRLFSENDEIALLQGILDFTSTRGDPYEKMDVFCEYVKKLINFDANKSQLVTKIRRLK
loki@rnoyes-Precision-T3610:~/Desktop/PlantTFDB/apomict$

ADD REPLY • link 6.6 years ago by jvire1 ▴ 10