BLAST Multiple Subject sequences?
2
0
Entering edit mode
8.1 years ago
zgayk ▴ 90

I am trying to locate the orthologs of genes identified in a genome assembly in five other bird genomes (red-throated loon, adelie penguin, northern fulmar, chicken, pigeon).

To do this I made a blast database of each of the five genomes: makeblastdb -in '"Columba_livia.cds" "Fulmarus_glacialis.cds" "Gallus_gallus.cds" "Gavia_stellata.cds"' -out Expanded.txt -parse_seqids -dbtype nucl #new db

However, I am so far at a loss about how to search for each gene in my assembly and then return the best hit ortholog in all five genomes using local blast. I am new to running command-line blast and my best guess didn't work: blastn -query COLOgenes.fasta -db Expanded.txt -task blastn -dust no -outfmt "6 qseqid sseqid pident length qstart qend sstart send sallgi evalue qseq sseq" -evalue 1e-6 -num_alignments 1 -max_hsps 1 -out outputExpanded1.blast.txt

The most important thing is to get the actual sequence for each ortholog (should get five match sequences; one from each genome), but blast seems to only have options for returning one subject sequence.

Any ideas would be appreciated! Thanks, Zach

blast • 4.9k views
ADD COMMENT
1
Entering edit mode
8.1 years ago

It looks like you are putting all your subject sequences (5 birds) into one blast database. And then you are querying against that database with the parameter of outputting just a single alignment.

You have two options here:

1) If you want to learn some basic scripting skills, you can have blastn output all the possible subject hits and parse the resulting blast file for the top hits for each of the five species.

2) You can create individual blast databases for each of the five bird species instead of one big database. Then blast your query against the 5 databases separately with the parameter you have now to only output a single alignment.

ADD COMMENT
0
Entering edit mode

Thanks very much for the ideas, and sorry about the late response. I have done what you suggested under 2) and now have five blast databases, one for each bird species. But now the question I am struggling with is how to parse all five databases, each with a single alignment to the query, so that I can group potential orthologous sequences from each blast file together with the query.

For example for the gene CELSR3, I have the query species CELSR3, MATCH1_CELSR3, MATCH2_CELSR3, MATCH3_CELSR3, MATCH4_CELSR3, and MATCH5_CELSR3, with each ortholog of CELSR3 aligned to the query in a separate file, but I need to group the results from each file together for each possible ortholog.

Are there any tricks for doing this?

ADD REPLY
1
Entering edit mode
8.1 years ago
pld 5.1k

One approach is to use Best Reciprocal Hits: http://www.ncbi.nlm.nih.gov/pubmed/18042555/

Basically you blast your transcriptome against references, take the best hit and blast them against a database made out of your transcriptome. Orthologs would be instances where you have reciprocal best hits. The best hit of a is b, and the best hit of b is a.

You do not need the results to return the query sequence, just get the ID and you can retrieve the sequence from the blast database using blastdbcmd.

Something like:

  1. BLAST your transcriptome against your reference database. This is the "forward" blast run
  2. Collect the ID of the best hit for each query and use them to pull hit sequences from the reference database using blastdbcmd (query sequences for the reverse run)
  3. BLAST the reverse queries against a database made out of your transcriptome. This is the "reverse" blast run
  4. Find cases of best reciprocal hits, these are your potential orthologs.

With the approach you're using now, BLAST sees all of the input files as a single database. That is why you're not seeing five hits, blast only sees a single database. If you want orthologs for each reference species, you should make separate databases and run this process for each species.

This is typically run on peptide sequences, so I'd generate predicted peptides for your transcriptome and use blastp against reference peptides.

ADD COMMENT
0
Entering edit mode

Thank you very much for your help. The reciprocal best hits approach sounds like something to use in the future. RIght now I am just trying to get a cursory feel for what may be an orthologous gene using basic blastn, and later using a script to remove those that look like parlalogs. But, I will keep this idea in mind if it turns out I have data for this.

ADD REPLY

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6