Process tabular BLAST output
4
0
Entering edit mode
8.7 years ago
mrjuggle • 0

Dear all,

I am struggling with processing tabular BLAST output. I would like to do the following:

Take a Blast output table (simplified):

Sequence ID | e-value | GI
Seq1    0.001   12345
Seq1    0.001   34567
Seq1    0.001   478910
Seq2    ...


and convert it into something like that:

Sequence ID | GIs
Seq1    12345   34567   478910
Seq2    ...


Any ideas are welcome, I am sure I can figure it out myself but right now I have a hard time to figure out a solution.

BLAST Python • 2.4k views
2
Entering edit mode
8.7 years ago
5heikki 11k
cat testfile

Seq1    0.001    12345
Seq1    0.001    34567
Seq1    0.001    478910
Seq2    0.001    242424
Seq2    0.001    232322
Seq3    0.001    111111

for u in $(cut -f1 testfile | sort -u); do GIs=$(grep -w $u testfile | cut -f3 | tr "\n" "\t"); echo "$u     $GIs"; done Seq1 12345 34567 478910 Seq2 242424 232322 Seq3 111111  Edit. I added -w to the grep so it doesn't confuse e.q. Seq1, Seq11 and Seq111. The tab in echo is a literal one (ctrl+v+tab). If the file has 100s of thousands of lines or more, it will take a long time to finish. ADD COMMENT 0 Entering edit mode This is genius! Runs thru the file multiple times though, but the approach is cool! ADD REPLY 1 Entering edit mode 8.7 years ago Ram 38k Well, here's a rough outline of the logic I'd use: 1. Read file, skip header 2. For each non-header line of input_file • add col1 as key to a dict (if it doesn't exist already) • as value, add col3 (if key was just added) or "\tcol3" (if key already existed) 3. Print custom header 4. Print dict key \t dict value for all dict entries ADD COMMENT 0 Entering edit mode I would go with a similar workflow, except that I would not store the result in a dictionary if the blast file is huge. Just output the key and value when a new key is encountered. ADD REPLY 0 Entering edit mode I agree, but I was just assuming that something like Seq 1 ... Seq 2... Seq 1 might happen. File rw cursors/streams cannot move backward, unfortunately. Maybe running it thru a sort pipeline and then your workflow would ensure both memory efficiency and accuracy. ADD REPLY 1 Entering edit mode 8.7 years ago Since there is the python keyword in the original question, here's my solution using (mostly) python: echo "Sequence ID|GI" > output.txt tail -n+2 test.txt \ | sort -k1,1 \ | python -c " import sys GRP_BY= 0 # Group by this column CAT_IDX= 2 # Concatenate this column SEP= ',' # Concatenate using this string current= None group= [] for line in sys.stdin: line= line.strip().split('\t') if (line[GRP_BY] == current) or current is None: group.append(line[CAT_IDX]) else: print(current + '\t' + SEP.join(group)) group= [line[CAT_IDX]] current= line[GRP_BY] print(current + '\t' + SEP.join(group)) " >> output.txt  It assumes: First line is the header and it's skipped; columns are tab separated. No need to pass through the input file more than once. The sorting can be skipped if the file is already sorted by first column. ADD COMMENT 0 Entering edit mode Thanks for all the suggestions! @heikki5: that's super clever and comes in handy so often, I will definitely add that to my repertoire I am especially interested in Python based solutions as I am currently progressing in learning it. I have asked this question since iam currently working on a little pipeline for NGS data processing. However, I am only ending up with an empty output file. current = None finalin = open(processed, 'rb')</code> reader = csv.reader(finalin, delimiter=',')</code> finalout = open(modified, 'wb')</code> writer = csv.writer(finalout)</code> for row in reader: if len(row) < 3: continue if (row[0] == current) or current is None: concat.append(row[1]) else: concat = [row[1]] writer.writerow(row[0] + (SEP.join(concat)))  ADD REPLY 0 Entering edit mode Might wanna initialize concat to an empty array so: concat = [] concat.append(row[1])  Also, I don't see why the () are needed in the if statement. And, why are we using a CSV reader again? I don't see how your input file is CSV. ADD REPLY 0 Entering edit mode 8.7 years ago edrezen ▴ 730 Hi, You can use gawk (or awk) cat blast.out | gawk 'BEGIN{s=0} { if (s!=$1) { s=$1; printf("\n%s ",s); } printf("%s ",$3); } END{printf("\n");}'

0
Entering edit mode

Nice! A bit convoluted, but nice.

Explanation (because I love explaining commands):

In the beginning, we set a variable s to 0. Then, for each line, we check if the value of s differs from the column1 value ($1). If they are different, we assign the value of column1 to s and print it in a new line. If column1 and s have the same value, we print the value of the third column ($3) on the current output line. Once done with the entire file, we print a new line to wrap stuff up.

I'd have used a default value of empty string for s, but 0 works too, I guess.