Question: Blast Multi queries output
0
gravatar for malj
25 days ago by
malj10
malj10 wrote:

Hello,

Following this post: A: how to parse blast output using biopython

I'm trying to get also the features of the alignment (the one is written in the picture) but I can't figure out how to do it.

My file is a xml that corresponds to several blast sequences, so I also have the question how to distinguish between queries. For each query I would have to get the one with highest score from genomics sequences (no transcript) and the feature corresponding to the name of the transcript (name of the gene that codify for protein, in this example Apoptosis regulator BCL2).

Thanks!,

Example BCL2

blast biopython • 176 views
ADD COMMENTlink modified 24 days ago • written 25 days ago by malj10

see Aligning Two Proteins With Their Domains/Annotations

ADD REPLYlink written 25 days ago by Pierre Lindenbaum114k

I forgot to mention that the region I want to blast is intronic and quite short (20 nt), so when I follow your advice it doesn't detect anything. The only way I think it could be possible is taking the position and search for it like here is described with biopython here with exons ( I have to check if it is possible with introns too).

https://biopython.org/wiki/Coordinate_mapping

ADD REPLYlink modified 25 days ago • written 25 days ago by malj10

well, in that case, you can do a blast from the command line using blast-short task, and format the output like a BED file using the text column based output.

After that you can intersect your alignments with an annotation file using bedtools. That way each alignment will get an overlapping feature.

ADD REPLYlink written 24 days ago by Istvan Albert ♦♦ 78k
2
gravatar for malj
24 days ago by
malj10
malj10 wrote:

Finally I took the position like this:

from Bio.Blast import NCBIXML
result=open(folder+"Blast-Alignment.xml","r")
records= NCBIXML.parse(result)
item=next(records)
for alignment in item.alignments:
    for hsp in alignment.hsps:
            if hsp.expect < 0.01:
                    print('****Alignment****')
                    print('sequence:', alignment.title)
                    print('length:', alignment.length)
                    print("start:", hsp.sbjct_start)
                    print("end:", hsp.sbjct_end)
                    print('score:', hsp.score)
                    print('gaps:', hsp.gaps)
                    print(hsp.query[0:90] + '...')
                    print(hsp.match[0:90] + '...')
                    print(hsp.sbjct[0:90] + '...')

Output:

****Alignment****
sequence: gi|568815580|ref|NC_000018.10| Homo sapiens chromosome 18, GRCh38.p12 Primary Assembly
length: 80373285
start: 63307591
end: 63307572
score: 20.0
gaps: 0
CTGCTTTGTAACAGCTTGTG...
||||||||||||||||||||...
CTGCTTTGTAACAGCTTGTG...

And with pyensembl I donwloaded manually the reference genome and look for the concrete position giving back the gene located there.

import pyensembl
ensembl = pyensembl.EnsemblRelease()

data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    gtf_path_or_url='C:\\Users\\User\\AppData\\Local\\pyensembl\\Homo_sapiens.GRCh38.93.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=18, position=63307591)
print(gene_names)

Output:

['BCL2']
ADD COMMENTlink written 24 days ago by malj10

nice job tracking it down and posting the answer

ADD REPLYlink written 24 days ago by Istvan Albert ♦♦ 78k
1
gravatar for Istvan Albert
25 days ago by
Istvan Albert ♦♦ 78k
University Park, USA
Istvan Albert ♦♦ 78k wrote:

I think the OPs question is how to produce the content called "Features" of a blast search when doing it at the command line. It is not quite explained in the BLAST web manual of how that field is filled out, moreover that field does not appear when doing a command line blast.

Note how the search is against chromosome 18 but the hit is BLC2 transcript. I am guessing that the feature overlaps with the alignment.

The strategy to do the same at the command line is not to parse the BLAST outputs, it is too messy and error-prone.

Instead, I would blast directly against the database of interest, in this case, proteins or transcripts, you'll get better and more reliable results, that hold across multiple genome releases, whereas when blasting against chromosomes you are tying the results to a given genome build.

ADD COMMENTlink modified 25 days ago • written 25 days ago by Istvan Albert ♦♦ 78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 736 users visited in the last hour