Question

Process tabular BLAST output

0

Entering edit mode

9.8 years ago

mrjuggle • 0

Dear all,

I am struggling with processing tabular BLAST output. I would like to do the following:

Take a Blast output table (simplified):

Sequence ID | e-value | GI
Seq1    0.001   12345
Seq1    0.001   34567
Seq1    0.001   478910
Seq2    ...

and convert it into something like that:

Sequence ID | GIs
Seq1    12345   34567   478910
Seq2    ...

Any ideas are welcome, I am sure I can figure it out myself but right now I have a hard time to figure out a solution.

BLAST Python • 2.8k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by mrjuggle • 0

Ram · Answer 1 · 2014-07-14

2

Entering edit mode

9.8 years ago

5heikki 11k

cat testfile

Seq1    0.001    12345
Seq1    0.001    34567
Seq1    0.001    478910
Seq2    0.001    242424
Seq2    0.001    232322
Seq3    0.001    111111

for u in $(cut -f1 testfile | sort -u); do GIs=$(grep -w $u testfile | cut -f3 | tr "\n" "\t"); echo "$u     $GIs"; done

Seq1    12345    34567    478910    
Seq2    242424    232322    
Seq3    111111

Edit. I added -w to the grep so it doesn't confuse e.q. Seq1, Seq11 and Seq111. The tab in echo is a literal one (ctrl+v+tab). If the file has 100s of thousands of lines or more, it will take a long time to finish.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by 5heikki 11k

0

Entering edit mode

This is genius! Runs thru the file multiple times though, but the approach is cool!

ADD REPLY • link 9.8 years ago by Ram 43k

score 1 · Answer 2 · 2014-07-14

1

Entering edit mode

9.8 years ago

Ram 43k

Well, here's a rough outline of the logic I'd use:

Read file, skip header
For each non-header line of input_file
- add col1 as key to a dict (if it doesn't exist already)
- as value, add col3 (if key was just added) or "\tcol3" (if key already existed)
Print custom header
Print dict key \t dict value for all dict entries

ADD COMMENT • link 2.4 years ago by Ram 43k

0

Entering edit mode

I would go with a similar workflow, except that I would not store the result in a dictionary if the blast file is huge. Just output the key and value when a new key is encountered.

ADD REPLY • link 9.8 years ago by Zhaorong ★ 1.4k

0

Entering edit mode

I agree, but I was just assuming that something like Seq 1 ... Seq 2... Seq 1 might happen. File rw cursors/streams cannot move backward, unfortunately. Maybe running it thru a sort pipeline and then your workflow would ensure both memory efficiency and accuracy.

ADD REPLY • link 9.8 years ago by Ram 43k

Ram · Answer 3 · 2014-07-14

1

Entering edit mode

9.8 years ago

dariober 14k

Since there is the python keyword in the original question, here's my solution using (mostly) python:

echo "Sequence ID|GI" > output.txt
tail -n+2 test.txt \
| sort -k1,1  \
| python -c "
import sys

GRP_BY= 0  # Group by this column
CAT_IDX= 2 # Concatenate this column
SEP= ','   # Concatenate using this string

current= None
group= []
for line in sys.stdin:
    line= line.strip().split('\t')
    if (line[GRP_BY] == current) or current is None:
        group.append(line[CAT_IDX])
    else:
        print(current + '\t' + SEP.join(group))
        group= [line[CAT_IDX]]
    current= line[GRP_BY]
print(current + '\t' + SEP.join(group))
" >> output.txt

It assumes: First line is the header and it's skipped; columns are tab separated.

No need to pass through the input file more than once. The sorting can be skipped if the file is already sorted by first column.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by dariober 14k

0

Entering edit mode

Thanks for all the suggestions!

@heikki5: that's super clever and comes in handy so often, I will definitely add that to my repertoire

I am especially interested in Python based solutions as I am currently progressing in learning it. I have asked this question since iam currently working on a little pipeline for NGS data processing. However, I am only ending up with an empty output file.

current = None
finalin = open(processed, 'rb')</code>
reader = csv.reader(finalin, delimiter=',')</code>
finalout = open(modified, 'wb')</code>
writer = csv.writer(finalout)</code>

for row in reader:
    if len(row) < 3:
        continue
    if (row[0] == current) or current is None:
        concat.append(row[1])
    else:
        concat = [row[1]]
    writer.writerow(row[0] + (SEP.join(concat)))

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by mrjuggle • 0

0

Entering edit mode

Might wanna initialize concat to an empty array so:

concat = []
concat.append(row[1])

Also, I don't see why the () are needed in the if statement.

And, why are we using a CSV reader again? I don't see how your input file is CSV.

ADD REPLY • link 2.4 years ago by Ram 43k

Ram · Answer 4 · 2014-07-14

0

Entering edit mode

9.8 years ago

edrezen ▴ 730

Hi,

You can use gawk (or awk)

cat blast.out | gawk 'BEGIN{s=0} { if (s!=$1) { s=$1; printf("\n%s ",s); } printf("%s ", $3); } END{printf("\n");}'

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by edrezen ▴ 730

0

Entering edit mode

Nice! A bit convoluted, but nice.

Explanation (because I love explaining commands):

In the beginning, we set a variable s to 0. Then, for each line, we check if the value of s differs from the column1 value ($1). If they are different, we assign the value of column1 to s and print it in a new line. If column1 and s have the same value, we print the value of the third column ($3) on the current output line. Once done with the entire file, we print a new line to wrap stuff up.

I'd have used a default value of empty string for s, but 0 works too, I guess.

ADD REPLY • link 2.4 years ago by Ram 43k