Searching for gene in multiple genomes
3
0
Entering edit mode
5 weeks ago
Miya • 0

Hello all,

I have a gene sequence(approximately 1300 bp), the presence of which I want to check in a large number of genomes. (strains of the same bacteria, in most of which I'm sure it has to be present). What is the best way to do it?

Should I try doing multiple alignment?

I tried using seqkit: seqkit grep -c -s -f gene.fasta multiple_genes.fasta -C, but I'm not sure this is the way, since even with possible mismatches = 50 I got only a quarter of genomes, which can hardly be true.

gene genome alignment sequence • 519 views
ADD COMMENT
2
Entering edit mode
5 weeks ago
Dave Carlson ★ 1.7k

Assuming you have (or can generate) a protein fasta file for your gene, I would concatenate the genome assemblies into a single multi-fasta file and then use miniprot to search for your protein. Examine the contig headers in your .paf or .gff output file to determine which strains likely have the gene.

ADD COMMENT
0
Entering edit mode

thanks, I will try doing it!

ADD REPLY
0
Entering edit mode
5 weeks ago
BioinfGuru ★ 1.7k

Download the NCBI blast package, and blast the sequence against the large set of sequences on your own machine.

Quick Overview - but you will want to run blastn (for DNA) not blastp (for proteins)

  1. Download and install
  2. Make a database of your set of genomic sequences
  3. Blast your query sequence against the database you just created

Useful links:

NCBI Blast command line package installers

Paper

User Guide

Manual

Voila!

ADD COMMENT
0
Entering edit mode

Been going deep into the Biostrings package in R today. It is actually ideal for this.

vmatchPattern() matches 1 string to many strings, or vice versa

ADD REPLY
0
Entering edit mode
5 weeks ago
theclubstyle ▴ 40

If you're not expecting a huge number of hits, a simple web-based blast could also work. Just limit the search to the parent taxon under 'Organism' (i.e. species, probably in this case) and under 'Algorithm parameters' opt for the maximum number of reportable hits (5,000) under 'Max target sequences'. If you're expecting significantly more than 5,000 hits, you're better off running it locally on a server as BioinfGuru has suggested.

ADD COMMENT

Login before adding your answer.

Traffic: 1630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6