Hello, I'm very new to programming for bioinformatics but have a task I need to be able to automate. I need to blastn search a specific nucleotide sequence (X) in the human genome. Then wherever it shows up, I need to retrieve the 100bp sequences up- and downstream of its location in the genome.
I've already learned how to run a BLAST search via Python, and how Seq objects can behave as strings. But I don't know anything about BLAST data and what kind of information is included in an xml file.
I think I could achieve this task if I had three things from BLAST output:
The complete sequence for every region X is found in (let's say it's called "full_seq")
The location of X in the genome (a-b)
The location of the surrounding region in the genome (c-d)
Then the location of the 100bp flanking sequences would be something like full_seq[a-c-99:a-c-1] and full_seq[b-c+1:b-c+99], right?
Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how? And does Python have any way of carrying out BLAST in specific organisms? I've only seen the option to change the database i.e. non-redundant, human genomic+transcript. And I've seen an answer to this question say that you can enter the entrez query ID but entrez query and organism are two separate boxes in the BLAST search menu so I don't think that's it.
I've also heard that UCSC BLAT already returns the flanking sequences but I've never seen it.