How to get names and class information from a list of target IDs from the JASPAR database?
2
0
Entering edit mode
15 months ago
DNAngel ▴ 250

I have a list of target IDs (think like KEGG ids or Taxids from blast) for TFs identified ffrom our data using JASPAR core databases (vertebrate and insect). I want to make some nice tables using this information but I find no easy way to extract the IDs names and class information from JASPAR.

For example I have a list of ids like:

MA0052.4
MA06602.1
MA0497.1

I want to get their associated TF name, class, and even family information which would be something like:

MA0052.4      MEF2A      MADS box factors     Regulators of differentiation
MA06602.1     Arid5a     ARID                 ARID-related
MA0497.1      MEF2C      MADS box factors     Regulators of differentiation

But there seems to be no way to extract this info from JASPAR itself by providing a list. I have to insert these ID names one by one to get them and I have a list of 100s of IDs to go through. Anyone figured out a workaround for this?

JASPAR • 1.1k views
ADD COMMENT
0
Entering edit mode
15 months ago
ATpoint 81k

JASPAR provides a collection of all PFMs, one motif per file. This can be downloaded and then be queried with some simple bash:


# The targets
$ cat targets.txt 
MA0052.4
MA06602.1
MA0497.1

# Download from https://jaspar.genereg.net/downloads/ and unzip, that creates many *.jaspar files
wget https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_redundant_pfms_jaspar.zip
unzip JASPAR2022_CORE_redundant_pfms_jaspar.zip

# Query the files
while read p
  do 
    find . -maxdepth 1 -name "${p}.jaspar" \
    | xargs cat \
    | head -n 1 \
    | paste <(echo $p) <(cut -f2 /dev/stdin)
  done < <(cat targets.txt)

# That's the output
MA0052.4    MEF2A
MA06602.1   
MA0497.1    MEF2C

The 2nd one cannot be found.

ADD COMMENT
0
Entering edit mode

The second one was a typo on my part but this isn't giving the class or family information. The TRANSFAC files seem to have that info but I worry if they are differently annotated somehow.

ADD REPLY
1
Entering edit mode

Then take this as a template and modify it to work with the transfac files rather than the jaspar files. It's a great bash-fu training exercise, an essential skill for any analyst.

ADD REPLY
0
Entering edit mode
15 months ago

using a XSLT stylesheet:

for F in MA0052.4  MA0497.1  MA0662.1  ; do wget -O - -q "https://jaspar.genereg.net/matrix/${F}/" | xsltproc --html biostar9551867.xsl  - 2> /dev/null ; done 
MA0052.4    MEF2A   MADS box factors    Regulators of differentiation
MA0497.1    MEF2C   MADS box factors    Regulators of differentiation
MA0662.1    MIXL1   Homeo domain factors    Paired-related HD factors

with biostar9551867.xsl :

ADD COMMENT
0
Entering edit mode

Oh this is helpful - but I am unsure how do you get this xsl output?? Also I tried this but it didn't work :/

ADD REPLY
0
Entering edit mode

but I am unsure how do you get this xsl output??

I don't understand

Also I tried this but it didn't work

https://meta.stackexchange.com/questions/147616/

ADD REPLY
0
Entering edit mode

I don't understand where or what the 'xsl file is in the command: "xsltproc --html biostar9551867.xsl". Or is this a form of output that is available somewhere? I'm just not following what this line of code is doing.

ADD REPLY
1
Entering edit mode

biostar9551867.xsl is a XSLT spreadsheet . It's the file I provided in the link to gist.github.com . It takes as input the html page downloaded from jaspar on standard input ('-') and basically for each HTML input (line 4) it searches for the table containing the data you want (line 13). In that table it extracts the values for "matrix ID" (line 14) etc...

ADD REPLY

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6