Searching and filtering uniclust databases
3.7 years ago
max_19 ▴ 170

Hi there,

Does anyone have experience with searching or filtering uniclust databases:

For example if i want to search for a particular organism? or filter for only eukaryotes? (in the uniclust30db)

I tried doing this with the mapping file that is supplied (uniclust_uniprot_mapping.tsv.gz) it has uniprot accessions for each protein, and a uniclust ID, however, I'm not sure how I can use that ID to search the actual uniclust database, or filter for particular organisms.

thanks for your help and ideas!

uniclust protein databases • 1.3k views
3.7 years ago
AK ★ 2.1k

Hi max_19,

I think you can first get the ID list of particular organisms and use that to search on the header of uniclust30_2018_08/uniclust30_2018_08_consensus.fasta. The header looks like (Members contains the information you need here):

uc30-1808-83688326|Representative=A0A0D6LSX8 n=28 Descriptions=[Uncharacterized protein|Twk-43 (Fragment)|TWiK family of potassium channels protein 9|Twk-9|Protein CBR-TWK-9|Ion channel] Members=A0A2G5TZA1,A0A2A6BWN3,E3N9Z5,H3EBY7,A0A061AD18,A0A182E8X8,A0A2A2JAF3,A0A0B2UVL7,A0A2A6CBY6,A0A016U7K0,A8Y2T1,A0A1I8AN73,A0A2A2LWD0,A0A0D6LSX8,A0A0C2GMZ3,A0A1I8AAQ9,E3N9Z7,A0A0C2CU15,A0A2P4W1B0,A0A016U896,A0A2P4W1B3,A0A0B1TTC8,A0A016U8H3,H3F3P7,Q23435,A0A2K6W7A5,A0A2H2IN74,A0A0R3S4C4

For instance:

# From we know that the "Taxon identifier" is 2759 for Eukaryota
# Here we take the first 10 as an example
curl -s "" \
  | grep -v '^Entry' \
  | head \
  > eukaryota_head10.txt

# Get the whole list of IDs from uniclust30
seqkit fx2tab --name uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_IDs.txt

# Search for the exact match of the desired IDs (here the IDs from Eukaryota) and extract the matches
grep -w -f eukaryota_head10.txt uniclust30_2018_08_consensus_IDs.txt \
  | cut -d" " -f1 \
  | sort -u \
  > uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt

# Subset uniclust30 using the list
seqkit grep --delete-matched -f uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_eukaryota_head10.fasta

Hope it helps.

Thanks for the helpful information! The uniclust download that I am using does not contain the uniclust30_2018_08_consensus.fasta file . I downloaded this one (uniclust30_2018_08_hhsuite.tar.gz) because I am using the database with HHsuite eventually.

Here are the files that I have when I extract the database

uniclust30_2018_08_a3m_db         uniclust30_2018_08_cs219.ffdata   uniclust30_2018_08_hhm.ffdata
uniclust30_2018_08_a3m_db.index   uniclust30_2018_08_cs219.ffindex  uniclust30_2018_08_hhm.ffindex
uniclust30_2018_08_a3m.ffdata     uniclust30_2018_08.cs219.sizes    uniclust30_2018_08_md5sum
uniclust30_2018_08_a3m.ffindex    uniclust30_2018_08_hhm_db         
uniclust30_2018_08.cs219          uniclust30_2018_08_hhm_db.index

Do you know the equivalent file here? or which file i can use to subset?



