Question: Redirecting pHMMER output to /dev/null
0
gravatar for mss
19 months ago by
mss30
mss30 wrote:

Hey guys,

Something that should be so simple has been so difficult for me to resolve the past few weeks. I am using pHMMER to search my ~12,000 fungal gene predictions against the MEROPS database. The issue is that for some of my gene queries, over a 1,000 hits will be returned, making it impossible to sort through all queries for their top hit. There are so many hits that not all of them can fit in one excel spreadsheet.

**Edit Here is a subset of what my table looks like. As you scroll down you can see that for just the first gene there are almost 1,000 hits.

Well after contacting the developer, he suggested redirecting the main output to /dev/null so that only the top hit of each query remains. He said the script should look like this

phmmer --tblout 1371E_merops5.tbl /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null; head -4 1371E_merops5.tbl

However, this still does not work, and the developer says it has to do with the command line, not the program. Does anyone here have experience with this???

Thanks in advance! Morgan

ADD COMMENTlink modified 18 months ago • written 19 months ago by mss30

What exactly are you left with? What does your *.tbl file look like before and after head -4?

Its not clear if you're outputting the result to the .tbl or to the STDOUT (or both)?

ADD REPLYlink written 19 months ago by Joe18k

I am just directing the output to .tbl only using --tblout

Before

After

ADD REPLYlink modified 19 months ago • written 19 months ago by mss30
2
gravatar for Jean-Karim Heriche
19 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

You're only redirecting STDOUT, you also need to redirect STDERR. For POSIX-compatible shells, use:

phmmer ... >/dev/null 2>&1
ADD COMMENTlink modified 19 months ago • written 19 months ago by Jean-Karim Heriche23k

So would it look this? i.e do I have the 2>&1 in the correct position? Do I keep or remove the ; after null? (/dev/null; ....)

phmmer --tblout 1371E_merops5.tbl /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null 2>&1 head -4 1371E_merops5.tbl
ADD REPLYlink modified 19 months ago • written 19 months ago by mss30
2

Remove your head operation completely. That is independent of the output of phmmer in this case. It creates the file you need, so we're only interested in discarding the STDOUT.

It is typically placed at the end of the command, you can satisfy yourself about how STDERR and STDOUT etc work like so:

$ ( echo "I'm stderr" > /dev/stderr ) > /dev/null
I'm stderr
$ ( echo "I'm stderr" > /dev/stderr ) > /dev/null 2>&1

#  no output 
$ ( echo "I'm stdout" > /dev/stdout ) > /dev/null

# also no output

&1, and &2 are what is known as file descriptors, and are essentially shorthand for /dev/stdout and /dev/stderr. Since the redirection operator > by default only redirects the STDOUT, you need to tell it to also discard STDERR. The syntax 2>&1 basically says, "take the output thats coming from 2 and send it in to 1. Then when you redirect with >, the STDERR is there too.

Another way of illustrating this is to leave the STDOUT alone, and just redirect the STDERR, which you could do like so:

$ ( echo "I'm stderr" > /dev/stderr ) 2> /dev/null

# empty output

Notice the 2 before the > this time.

ADD REPLYlink modified 19 months ago • written 19 months ago by Joe18k

Thanks for your help, I understand now what you are saying about the differences in stdout and stderr but how does this script only give me the top hit for each of my queries? Wouldn't I need to add something like head -4, to tell it to throw everything out of the stderr that isn't the top hit?

When I used

/dev/null 2>&1

I still have over a 1,000 hits for many of my genes, picture of partial output Here is the exact script I used as well.

#!/bin/bash
#SBATCH -t 96:00:00 -o 1371E_phmmer6.out -J F1hmm
module load hmmer/3.1b2
phmmer -E 0.001 --tblout 1371E_merops6.tbl --cpu 20 /work/Geomicrobiology/msobol/IODP_329_SPG/1371E14H2/maker/1371E_uni_snap.maker.output/1371E_uni_snap.renamed.maker.proteins.fasta /work/Geomicrobiology/msobol/databases/pepunit.lib > /dev/null 2>&1
ADD REPLYlink modified 19 months ago • written 19 months ago by mss30
1

You're confusing the output of phmmer with your results file.

If I'm understanding correctly, you just want to keep the top hit from the .tbl file? In which case, you can choose to do whatever you like with the STDOUT from phmmer because it has no bearing on that file.

You just need to use head -n 1371E_merops6.tbl where n is how many lines/hits you want. You can do this as a separate command, or follow the phmmer call with the semicolon as you were doing before. The 2 commands are entirely independent of one another, you're just using the output file of one, as the input file to the other.

ADD REPLYlink written 19 months ago by Joe18k

Yes, I thought that the .tbl and the output were the same file since I was using --tblout .

That is correct, I wanted the results file to only include the top hit from each gene. When I was contacting the developer, what he said/what I understood was to discard the main output to /dev/null, save a tabular output file instead, and take the first hit from the tabular output (first 4 lines, including 3 header lines) by doing

% `(phmmer --tblout out.tbl my.query my.db > /dev/null; head -4 out.tbl)`

so then the results .tbl file would only include the top hit for each gene.

So what you are saying I need to do is

phmmer --tblout out.tbl my.query my.db > /dev/null 2&>1; head -4 out.tbl

?

ADD REPLYlink modified 19 months ago • written 19 months ago by mss30
1

Yes that should work (guessing entirely off the way UNIX tools typically work though, not recent experience with HMMer).

When you use the ; syntax, those 2 commands are executed one after another, as if you typed them seperately in to the commandlind: command1 -some arg ; command2 -some -other args. In this case command1 and command2 are entirely independent, they're just called on the same line.

Your tbl file should remain the same whether you discard the STDOUT to /dev/null or not (I would expect), so you only need to redirect it if you don't want to see screen output. The file will be created all the same either way.

ADD REPLYlink modified 19 months ago • written 19 months ago by Joe18k

Unfortunately, it did not work. My tbl output still looks the same. :/ I am kind of at a lost...

ADD REPLYlink written 18 months ago by mss30
1

Can you edit your question with what your output data looks like and what you want it to look like? I still don't really understand in what way the tbl file is incorrect.

ADD REPLYlink written 18 months ago by Joe18k

I edited my original question with what the tbl file looks like in Excel. I want one hit for each gene, but instead I am getting several 100 to 1000 hits for each query.

ADD REPLYlink written 18 months ago by mss30
1

Where did you get head -4 from? Your problem isn't with phmmer's output, you just need to filter the tbl file.

The /dev/null business has nothing to do with the .tbl file as I mentioned before, that purely affects what's printed to the screen.

If you want just the top hit from that file for instance, then you need only use head -n 2 | tail -n 1. If you want the top hit on a per-query basis, that requires some more heavy duty scripting.

ADD REPLYlink written 18 months ago by Joe18k

The developer of hmmer recommended that I use head -4 in conjunction with /dev/null. Unfortunately, what I need is just the top hit per-query basis and I don't have the skills to create a script to do this. I wonder why hmmer does not have the option to output just one hit per a query like blast allows??

Well I thank you for your help, and I think I will just switch to blast to accomplish this task. Thanks!

ADD REPLYlink modified 18 months ago • written 18 months ago by mss30
2

A script to pull out the data you need wouldn't be very difficult to write (and solutions you could use for this definitely exist on this and other foums) and would be a good learning exercise, I would encourage you to try it...

ADD REPLYlink written 18 months ago by Joe18k

I don't have much experience in using programming languages to create scripts, but I will see what I can find. Thanks for the advice!

ADD REPLYlink written 18 months ago by mss30

You just need to break it down in to its 'component' conceptual tasks.

The task boils down to:

  • Loop over each row of the file
  • Check if that hit/gene ID has been seen before
    • If yes: output that line
    • If no: keep going

It's pretty much that simple.

(Assuming its already sorted by ID, and then by hit score).

ADD REPLYlink modified 18 months ago • written 18 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1021 users visited in the last hour