Is there a way to extract the gene product name from the CDS of a BLAST hit?
2
0
Entering edit mode
13 months ago
didiDeee • 0

You may skip to the question part if it seems clear enough without the context.

I'm trying to automate the process of extracting specific info from BLAST searches.

When doing it manually, I would blastn some sequences against the nt database, and record the following info of the top hit: the query coverage, the percent identity, start of the aligned range (hit_start), and the gene product of the first CDS.

The first two would be on the result page, and the last two would require clicking on the hit to see the range first, and then open the GenBank info to find the CDS info.

Since I need to go through a large amount of sequences, I'd like to automate this process. I managed to get everything with Biopython except for the product name, I guess it's because the other attributes could be extracted from the blast results, but the genbank info is really independent from the blast results.

So my question now is: if I obtained the accession id or gi number from the blast results, as well as the range of alignment, i.e. the location within the gene, is there a way to somehow connect to genbank or other databases to extract the first gene product name within or overlapping this range?

I hope I explained myself clearly... I've been trying to rearrange keywords to look for a solution or part of a solution to this problem but haven't been successful.

Edit: More specifically, my immediate question would be: is there a way to extract GenBank info on a specific region instead of pulling the entire file? For example, on the BLAST result page, if I click on the GenBank info of a hit, it would specify that this is a partial Genbank file for: e.g. ACCESSION CP009256 REGION: 3404381..3405501. When I do it in python, is there an option for retrieving a partial file like this, and then look for the product name in CDS?

alignment genbank blast gene biopython • 693 views
0
Entering edit mode

If you have the genbank format file then use code provided by @cpad0112 here : C: How to extract certain CDS from a GenBank file in linux terminal

Otherwise provide an example accession number and CDS you want to see.

0
Entering edit mode

Thank you. I guess I'll first try to figure out how to get the genbank files, and then try to filter through the CDS's based on their position to get the one I want.

0
Entering edit mode

Also, please see my edited note: If I would like to retrieve a partial GenBank file for a specific region, e.g. ACCESSION CP009256 REGION: 3404381..3405501, is this accomplishable? I also have the gi number if that's more feasible.

2
Entering edit mode
13 months ago
GenoMax 107k

Using EntrezDirect. Sequence truncated for space reasons.

\$ efetch -db nuccore -id "CP009256" -chr_start 3404381 -chr_stop 3405501 -format fasta
>CP009256.1:3404382-3405502 Acinetobacter baumannii strain AB031, complete genome
TTGACGTACTGTGTCATGGCTTGTAGTATCACTTATTTCCAACTTTAACCTTATAAATCATATACTTATA
CAGCGTTAAACAGTGTCAGAAATTTGAATTTGGCGGAAGCGGTGAGATTCGAACTCACGGAGGACTCACA
CCCTCGTCGGTTTTCAAGACCGGTGCATTAAACCGCTCTGCCACGCTTCCAACGAGCGCTATAATATAAA

0
Entering edit mode

Thank you! This seems very helpful! If I use -format gb and then parse through the result for CDS product, it might solve my problem!

0
Entering edit mode

Be sure to add -style withparts if you are going to cherry pick regions. That should bring CDS info.

0
Entering edit mode

Since I'm doing this as part of my script, I used Biopython's Entrez.efetch, specified the seq_start and seq_end, and then used SeqIO to parse the result and find the CDS product name. It worked! Both the link you provided on finding CDS and the efetch parameters you suggested helped a lot, saving me tons of effort sifting through the Biopython/Entrez manuals. Thanks a lot.

0
Entering edit mode

You can accept my answer (green check mark) to provide closure to this thread.

0
Entering edit mode
13 months ago
JC 12k

Use NCBI Entrez utilities https://www.ncbi.nlm.nih.gov/books/NBK25501/

0
Entering edit mode

Thank you. I guess I need to comb through the Entrez utilities' commands to see if they have parameters that could suit my purpose better.