Question

Blast Multi queries output

0

Entering edit mode

6.7 years ago

malj ▴ 20

Hello,

Following this post: A: how to parse blast output using biopython

I'm trying to get also the features of the alignment (the one is written in the picture) but I can't figure out how to do it.

My file is a xml that corresponds to several blast sequences, so I also have the question how to distinguish between queries. For each query I would have to get the one with highest score from genomics sequences (no transcript) and the feature corresponding to the name of the transcript (name of the gene that codify for protein, in this example Apoptosis regulator BCL2).

Thanks!,

Example BCL2

biopython blast • 2.7k views

ADD COMMENT • link 6.7 years ago by malj ▴ 20

0

Entering edit mode

see Aligning Two Proteins With Their Domains/Annotations

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I forgot to mention that the region I want to blast is intronic and quite short (20 nt), so when I follow your advice it doesn't detect anything. The only way I think it could be possible is taking the position and search for it like here is described with biopython here with exons ( I have to check if it is possible with introns too).

https://biopython.org/wiki/Coordinate_mapping

ADD REPLY • link 6.7 years ago by malj ▴ 20

0

Entering edit mode

well, in that case, you can do a blast from the command line using blast-short task, and format the output like a BED file using the text column based output.

After that you can intersect your alignments with an annotation file using bedtools. That way each alignment will get an overlapping feature.

ADD REPLY • link 6.7 years ago by Istvan Albert 102k

1

Entering edit mode

6.7 years ago

Istvan Albert 102k

I think the OPs question is how to produce the content called "Features" of a blast search when doing it at the command line. It is not quite explained in the BLAST web manual of how that field is filled out, moreover that field does not appear when doing a command line blast.

Note how the search is against chromosome 18 but the hit is BLC2 transcript. I am guessing that the feature overlaps with the alignment.

The strategy to do the same at the command line is not to parse the BLAST outputs, it is too messy and error-prone.

Instead, I would blast directly against the database of interest, in this case, proteins or transcripts, you'll get better and more reliable results, that hold across multiple genome releases, whereas when blasting against chromosomes you are tying the results to a given genome build.

ADD COMMENT • link 6.7 years ago by Istvan Albert 102k

score 3 · Accepted Answer · 2018-10-23

Finally I took the position like this:

from Bio.Blast import NCBIXML
result=open(folder+"Blast-Alignment.xml","r")
records= NCBIXML.parse(result)
item=next(records)
for alignment in item.alignments:
    for hsp in alignment.hsps:
            if hsp.expect < 0.01:
                    print('****Alignment****')
                    print('sequence:', alignment.title)
                    print('length:', alignment.length)
                    print("start:", hsp.sbjct_start)
                    print("end:", hsp.sbjct_end)
                    print('score:', hsp.score)
                    print('gaps:', hsp.gaps)
                    print(hsp.query[0:90] + '...')
                    print(hsp.match[0:90] + '...')
                    print(hsp.sbjct[0:90] + '...')

Output:

****Alignment****
sequence: gi|568815580|ref|NC_000018.10| Homo sapiens chromosome 18, GRCh38.p12 Primary Assembly
length: 80373285
start: 63307591
end: 63307572
score: 20.0
gaps: 0
CTGCTTTGTAACAGCTTGTG...
||||||||||||||||||||...
CTGCTTTGTAACAGCTTGTG...

And with pyensembl I donwloaded manually the reference genome and look for the concrete position giving back the gene located there.

import pyensembl
ensembl = pyensembl.EnsemblRelease()

data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    gtf_path_or_url='C:\\Users\\User\\AppData\\Local\\pyensembl\\Homo_sapiens.GRCh38.93.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=18, position=63307591)
print(gene_names)

Output:

['BCL2']