Pretty formatting of pairwise alignment: XML/JSON to a multiline string
1
1
Entering edit mode
4.1 years ago

Hey all, I'm getting blastp results in a JSON, and want to turn them into a readable format to display on a website

to formulate this as a general question about pairwise alignment, I'm looking for a function (don't matter which programming language) that takes in the following arguments: Positions of the alignment on sequence_a (QUERY_FROM, QUERY_TO) Positions of the alignment on sequence_b (HIT_FROM, HIT_TO) The alignment length (ALIGN_LEN), sequence_a, sequence_b (QSEQ,HSEQ) Plus the middle line (MIDLINE)

and returns a long string with line breaks that is a readable representation of the alignment with positions on the sequences.

Sample input:

{
"query_from": 4,
"query_to": 96,
"hit_from": 1,
"hit_to": 99,
"align_len": 100,
"qseq": "MSDYSTMSSGYCSLEVELEDCFFTAK----RNLQSKQPTKNLCKAVEETWHPPTIQEIKQKIDSY---EKFCLGMKLSEDGYYTGFIKVVGLKLRRPVTV",
"hseq": "MTVDSSMSSGYCSLDEELEDCFFTAKTTFFRNLQSKQPSKNVCKAVEETQHPPTIQEIKQKIDSYNSREKHCLGMKLSEDGTYTGFIK-VHLKLRRPVTV",
"midline": "M+  S+MSSGYCSL+ ELEDCFFTAK    RNLQSKQP+KN+CKAVEET HPPTIQEIKQKIDSY   EK CLGMKLSEDG YTGFIK V LKLRRPVTV"
}

Requested output:

Query  4   MSDYSTMSSGYCSLEVELEDCFFTAK----RNLQSKQPTKNLCKAVEETWHPPTIQEIKQ  59
           M+  S+MSSGYCSL+ ELEDCFFTAK    RNLQSKQP+KN+CKAVEET HPPTIQEIKQ
Sbjct  1   MTVDSSMSSGYCSLDEELEDCFFTAKTTFFRNLQSKQPSKNVCKAVEETQHPPTIQEIKQ  60

Query  60  KIDSY---EKFCLGMKLSEDGYYTGFIKVVGLKLRRPVTV  96
           KIDSY   EK CLGMKLSEDG YTGFIK V LKLRRPVTV
Sbjct  61  KIDSYNSREKHCLGMKLSEDGTYTGFIK-VHLKLRRPVTV  99

Thanks!

Formatting alignment Blast • 793 views
ADD COMMENT
0
Entering edit mode

You can use blast_formatter program included in blast+ package, if you save the output in -outfmt 11 first.

$ blast_formatter -h
USAGE
  blast_formatter [-h] [-help] [-rid BLAST_RID] [-archive ArchiveFile]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-sorthits sort_hits] [-sorthsps sort_hsps]
    [-max_target_seqs num_sequences] [-out output_file] [-parse_deflines]
    [-version]

Take a look at biopython blast parser as well.

ADD REPLY
0
Entering edit mode

Thank you for the quick reply. I need a solution for the input I've written, as I can only use the JSON data I'm getting. I'm looking for some actual source code I can rewrite into my application

ADD REPLY
0
Entering edit mode
4.1 years ago

I ended up just writing it in PL/SQL, if you ever need to translate to another programming language, here it is:

function pretty_pairwise_alignment(
    ALIGN_LEN integer,
    query_from INTEGER,
    query_to INTEGER,
    hit_from INTEGER,
    hit_to INTEGER,
    qseq CLOB,
    hseq CLOB,
    midline  CLOB
) return clob is
    pretty_output clob;
    line_break varchar2(6) := chr(13)||chr(10);
    padding_constant integer := 4;
    alignment_consumed integer := 0;
    amount integer := 60;
    query_index integer := query_from;
    query_end integer;
    hit_index integer := hit_from;
    hit_end integer;
    sub_query varchar2(60);
    sub_hit varchar2(60);
    sub_midline varchar2(60);
begin
    dbms_lob.createtemporary(pretty_output,false);
    -- determin the amount of charachters in the stringified positions
    padding_constant := FLOOR(LOG(10,greatest(query_to,hit_to))) + 3;


    while (ALIGN_LEN > alignment_consumed) loop
        -- if this is the last line, fix the amount of charachter in this line
        if (ALIGN_LEN - alignment_consumed <= 60) then amount := ALIGN_LEN - alignment_consumed; end if;

        sub_query := DBMS_LOB.SUBSTR(qseq, amount, alignment_consumed+1);
        sub_midline := DBMS_LOB.SUBSTR(midline, amount, alignment_consumed+1);
        sub_hit := DBMS_LOB.SUBSTR(hseq, amount, alignment_consumed+1);
        query_end := query_index + amount - REGEXP_COUNT(sub_query,'-') - 1;
        hit_end := hit_index + amount - REGEXP_COUNT(sub_hit,'-') - 1;

        DBMS_LOB.APPEND(pretty_output,
            'Query  '||RPAD(TO_CHAR(query_index,'fm999999'),padding_constant)||
            sub_query||'  '||query_end||line_break||
            RPAD(' ',7+padding_constant)||sub_midline||line_break||
            'Sbjct  '||RPAD(TO_CHAR(hit_index,'fm999999'),padding_constant)||
            sub_hit||'  '||hit_end||line_break||line_break
        );

        query_index := query_end + 1;
        hit_index := hit_end + 1;
        alignment_consumed := alignment_consumed + amount;
    end loop;

    return pretty_output;
end pretty_pairwise_alignment;
ADD COMMENT

Login before adding your answer.

Traffic: 2544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6