Having trouble parsing blast tabular output with awk
0
0
Entering edit mode
3.6 years ago
pawhitesell ▴ 30

I am trying to parse a tabular text file generated by Blastp using awk. Previously I have used this somewhat ugly code, because it worked, to go through to the right columns and cull out values below what I wanted.

#!/bin/bash 
#$ -cwd
#$ -pe mpi 16

awk '$4 > 80.0' blastoutput.txt > StepOne.txt
awk '$5 > 70.0' StepOne.txt > Culled.txt

Using it on a new blast result however, the file sizes remain at 300k kb with only a slight decrease on step one, and none for two. My best guess is that it is only recognizing a single line from the whole blast output file, and therefore not removing more. I would think maybe it had something to do with Unix/Windows line ends not being recognized as I saw on other answers, but the thing is I haven't changed the way I've generated the blast results and it was working before, so I don't know why it would all of a sudden change the way tabular results are created.

I've also tried using some parsing options I saw in other answers like the following:

perl -lane 'print $_ if ($F[4] >80.0)' blastp_output_8_26.txt > StepOne.txt

but the results seem to be the same.

Does anyone know what I could do to the blastp output file to make it work with my code? I am convinced something is amiss there, but all my attempts to fix it so far have been for naught.

Thanks.

awk blast tabular • 1.6k views
ADD COMMENT
1
Entering edit mode

Does it work if you use $4 > 80 instead of $4 > 80.0? I know it makes no sense but just try it out once. Also, can you compare an old blast result and a new one side by side and make sure the columns line up?

ADD REPLY
0
Entering edit mode

I tried that ya (changing to just 80), and it didn't seem to make any difference.

Well I have only two so far to compare, but they were done with different output options for each column, but the same output format 6. To view them I have opened each text file in chrome, and besides having different columns they don't appear to be any different to me. They each have tabs that separate out the columns etc. Perhaps you are right though, and somehow by changing the output column contents it introduced some sort of issue with the line end?

ADD REPLY
1
Entering edit mode

Text files should be looked at in a text editor, not Chrome. What is the output to the following command for each of the two files:

head <file> | cut -f 1-10 | cat -te
ADD REPLY
0
Entering edit mode

The first file I don't have access to at the moment, but I will try it as soon as I can. For the file that doesn't work, here is the result:

VQLAPNITLNIPIISASMDTVTDSKMAISMARQGGLGVIHKNMSIAAQADEVRKVKRSESGVIIDPFFLTPTHLVADAEHLMSK^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia [Streptomyces hygroscopicus] ^I35.714^I31$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18021.1|ARO:3004344|PDC-81^Igb|AKR18021.1|ARO:3004344|PDC-81 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18013.1|ARO:3004336|PDC-73^Igb|AKR18013.1|ARO:3004336|PDC-73 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|BAA15934.1|ARO:3000829|baeS^Igb|BAA15934.1|ARO:3000829|baeS [Escherichia coli str. K-12 substr. W3110] ^I37.037^I100$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|AHV80711.1|ARO:3002863|dfrA8^Igb|AHV80711.1|ARO:3002863|dfrA8 [Salmonella enterica subsp. enterica serovar Typhimurium] ^I32.258^I94$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|BAA15221.2|ARO:3000263|marA^Igb|BAA15221.2|ARO:3000263|marA [Escherichia coli str. K-12 substr. W3110] ^I33.333^I45$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|ADP36409.1|ARO:3003730|Bifidobacteria^Igb|ADP36409.1|ARO:3003730|Bifidobacteria intrinsic ileS conferring resistance to mupirocin [Bifidobacterium bifidum PRL2010] ^I45.833^I37$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|AAD08227.1|ARO:3003964|hp1181^Igb|AAD08227.1|ARO:3003964|hp1181 [Helicobacter pylori 26695] ^I35.135^I52$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|ACF70723.1|ARO:3004103|QepA2^Igb|ACF70723.1|ARO:3004103|QepA2 [Escherichia coli] Partial^I37.931^I48$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|AQX36338.1|ARO:3004379|QepA4^Igb|AQX36338.1|ARO:3004379|QepA4 [Escherichia coli] ^I37.931^I48$
ADD REPLY
1
Entering edit mode

This is properly tab separated, all right. It should work.

EDIT: Works for me:

echo "VQLAPNITLNIPIISASMDTVTDSKMAISMARQGGLGVIHKNMSIAAQADEVRKVKRSESGVIIDPFFLTPTHLVADAEHLMSK^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia [Streptomyces hygroscopicus] ^I35.714^I31$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18021.1|ARO:3004344|PDC-81^Igb|AKR18021.1|ARO:3004344|PDC-81 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18013.1|ARO:3004336|PDC-73^Igb|AKR18013.1|ARO:3004336|PDC-73 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|BAA15934.1|ARO:3000829|baeS^Igb|BAA15934.1|ARO:3000829|baeS [Escherichia coli str. K-12 substr. W3110] ^I37.037^I100$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|AHV80711.1|ARO:3002863|dfrA8^Igb|AHV80711.1|ARO:3002863|dfrA8 [Salmonella enterica subsp. enterica serovar Typhimurium] ^I32.258^I94$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|BAA15221.2|ARO:3000263|marA^Igb|BAA15221.2|ARO:3000263|marA [Escherichia coli str. K-12 substr. W3110] ^I33.333^I45$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|ADP36409.1|ARO:3003730|Bifidobacteria^Igb|ADP36409.1|ARO:3003730|Bifidobacteria intrinsic ileS conferring resistance to mupirocin [Bifidobacterium bifidum PRL2010] ^I45.833^I37$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|AAD08227.1|ARO:3003964|hp1181^Igb|AAD08227.1|ARO:3003964|hp1181 [Helicobacter pylori 26695] ^I35.135^I52$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|ACF70723.1|ARO:3004103|QepA2^Igb|ACF70723.1|ARO:3004103|QepA2 [Escherichia coli] Partial^I37.931^I48$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|AQX36338.1|ARO:3004379|QepA4^Igb|AQX36338.1|ARO:3004379|QepA4 [Escherichia coli] ^I37.931^I48$" | \
sed 's/\^I/\t/g;s/\$//' | \
awk -F $'\t' '$4>80.0'

(Nothing over 80)

echo "VQLAPNITLNIPIISASMDTVTDSKMAISMARQGGLGVIHKNMSIAAQADEVRKVKRSESGVIIDPFFLTPTHLVADAEHLMSK^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia^Igb|CAA27276.1|ARO:3002661|APH(7'')-Ia [Streptomyces hygroscopicus] ^I35.714^I31$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18021.1|ARO:3004344|PDC-81^Igb|AKR18021.1|ARO:3004344|PDC-81 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|AKR18013.1|ARO:3004336|PDC-73^Igb|AKR18013.1|ARO:3004336|PDC-73 [Pseudomonas aeruginosa] ^I48.000^I93$
RLLNNLIIELFKYTAENERKQIRERQR^Igb|BAA15934.1|ARO:3000829|baeS^Igb|BAA15934.1|ARO:3000829|baeS [Escherichia coli str. K-12 substr. W3110] ^I37.037^I100$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|AHV80711.1|ARO:3002863|dfrA8^Igb|AHV80711.1|ARO:3002863|dfrA8 [Salmonella enterica subsp. enterica serovar Typhimurium] ^I32.258^I94$
MIETFVFSSESIFLREEDQMKVHQLLDYLKSRK^Igb|BAA15221.2|ARO:3000263|marA^Igb|BAA15221.2|ARO:3000263|marA [Escherichia coli str. K-12 substr. W3110] ^I33.333^I45$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|ADP36409.1|ARO:3003730|Bifidobacteria^Igb|ADP36409.1|ARO:3003730|Bifidobacteria intrinsic ileS conferring resistance to mupirocin [Bifidobacterium bifidum PRL2010] ^I45.833^I37$
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR^Igb|AAD08227.1|ARO:3003964|hp1181^Igb|AAD08227.1|ARO:3003964|hp1181 [Helicobacter pylori 26695] ^I35.135^I52$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|ACF70723.1|ARO:3004103|QepA2^Igb|ACF70723.1|ARO:3004103|QepA2 [Escherichia coli] Partial^I37.931^I48$
IGPQKLVLMGLLLLSITQLMYVFMNELSPIWYYVLATAIMGMGNALFQSPNNTIVMSSVDK^Igb|AQX36338.1|ARO:3004379|QepA4^Igb|AQX36338.1|ARO:3004379|QepA4 [Escherichia coli] ^I37.931^I48$" | \
sed 's/\^I/\t/g;s/\$//' | \
awk -F $'\t' '$4>40.0'

RLLNNLIIELFKYTAENERKQIRERQR     gb|AKR18021.1|ARO:3004344|PDC-81        gb|AKR18021.1|ARO:3004344|PDC-81 [Pseudomonas aeruginosa]       48.000  93
RLLNNLIIELFKYTAENERKQIRERQR     gb|AKR18013.1|ARO:3004336|PDC-73        gb|AKR18013.1|ARO:3004336|PDC-73 [Pseudomonas aeruginosa]       48.000  93
NIKEKDERLDKIIKQILNIKDDNISFSKDAVSKKKIHGRMANVFTSVLTYKIDCCPLCGFKNIIR       gb|ADP36409.1|ARO:3003730|Bifidobacteria        gb|ADP36409.1|ARO:3003730|Bifidobacteria intrinsic ileS conferring resistance to mupirocin [Bifidobacterium bifidum PRL2010]    45.833  37

(3 rows over 40)

ADD REPLY
0
Entering edit mode

Your script indeed seems to work, (I still need to test on the full file), but I'm not 100% sure what your code does. It looks to me like you are replacing ASCII tabs with linux tabs? So I suppose if I were to run this together, I only need to run the sed command once, since the tabs will be proper now.

ADD REPLY
1
Entering edit mode

Don't worry about the sed - it exists to remove the symbols introduced by cat -te. It's the awk that's relevant.

ADD REPLY
1
Entering edit mode

First thing is the format in default blast outfmt is [qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore], I am not sure what are you filtering but awk counts starting index in 1, if your file is a starndard output, you are filtering by length then by mismatch.

$cat  file.tsv
q1      h1      90.0    1002    10      4       10      1012    1       1100    0.0001  987
q2      h2      50.2    330     2       1       200     534     1       335     0.01    97
q3      h3      99.0    100     0       0       1       100     1       100     0.00001 100
$ awk '$3 > 60' file.tsv
q1      h1      90.0    1002    10      4       10      1012    1       1100    0.0001  987
q3      h3      99.0    100     0       0       1       100     1       100     0.00001 100

you can combine filtering too:

$ awk '$3 > 60 && $4 > 200' file.tsv
q1      h1      90.0    1002    10      4       10      1012    1       1100    0.0001  987
ADD REPLY
0
Entering edit mode

My output format is this:

-outfmt "6 qseqid sseqid stitle pident qcovs"
ADD REPLY
1
Entering edit mode

then you need to check if your output file is corrupted or have strange char (\r). Your filtering has nothing wrong.

ADD REPLY

Login before adding your answer.

Traffic: 1902 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6