Question: Retrieve genomic coordinates (locus start, locus end) from blastn hits?
gravatar for sombrajo92
13 months ago by
sombrajo920 wrote:

Hi everyone, I'm trying to carry out some sort of reciprocal best hits analysis with blastn (so two-step process) in order to find orthologs (I know it is probably not the best way) of novel human miRNAs predicted by a software called miRDeep2. Working on linux and using blast 2.2.30+ version. I'm basically using blastn against the pre-formatted ncbi database of partially non-redundant nucleotides (the huge one, I downloaded from here*), but limiting the search by using -seqidlist with a file containing all accession numbers for all mammalian-human genes, contigs, chunks or DNA or whatever is in the databases. This accession list for mammals I got it by doing this:

"esearch -db nucleotide -query "Mammalia[organism]" | efetch -db nucleotide -format acc > mammalia.acc"

Then I did the same for homo sapiens, and I also subtracted both in order to have an accession list with mammals excluding humans;

"esearch -db nucleotide -query "Homo sapiens[organism]" | efetch -db nucleotide -format acc > hsa.acc"

The command line for running blastn looks something like this (the miRNA fasta files only contain one sequence plus an identifier per file, and the sequences are short, between 50 and 90 nt, the mammalian acc file has around 35M lines):

blastn -db nt -seqidlist mammalian_minus_hsa.acc -query mymiRNA.fa -out mymiRNA_blast.out -evalue 1e-10 -word_size 20 -outfmt "6 qacc sseqid sseq sstart send sscinames evalue"

In the end I get some hits against each of my novel miRNAs, and these hits of course have an ID, and it also tells you the start and end position of the match both in your query and in the subject sequence. But this is not what I need, now I would like to have the specific genomic locations of the matches for each subject sequences (locus start and locus end of the hit) in order to know where these sequences are located in mammals. I was thinking maybe I could get the genomic coordinates for the accession number of each hit, and then I would just need to add or subtract these positions to the start and end positions of the matches in my sequences...but I still haven't found a way of getting these genomic coordinates by using an ID, I would appreciate some suggestions...

I also need these locations because ultimately what I want to do is to take each of my novel human miRNAs, and for each best mammal hit I get (best one per organism), blast it back against the human genome, and check if the best hit I get now is located in the same position where I know my novel human miRNA is located, so that I can consider them orthologs, and for each of these novel miRNAs, have a fasta file with all its orthologs for further analysis.

I hope it is more or less clear what I intend to do.

Thank you in advance,


ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 13 months ago by sombrajo920

Hi Javier - Is there a specific reason to use the nt database and apply the massive filter? Or why don't you restrict your search to the human genome? My impression is all you need is human chromosomal coordinates so this might simplify your life tremendously (no bookkeeping of accessions in combination with nt database versions, etc...)

ADD REPLYlink written 11 months ago by Carambakaracho1.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1500 users visited in the last hour