Sort genomes for gene presence
1
0
Entering edit mode
5.1 years ago

I have a hard time solving a relatively simple problem (I am a novice)

I want to screen a collection of >8000 Prokaryot genomes for at specific gene, in order to make an prescence/absence table for a GWAS study. However, I am not sure how to go about this. I have the sequence of the gene in FASTA format, as well as all the genomes. Could I perhaps use Blast+ and create a local database of the genomes? I have the genomes as fna files directly from GenBank but I also have annotated versions from running Prokka.

Thank you in advance!

genome gene sequence • 746 views
ADD COMMENT
0
Entering edit mode

You can use searching gene against genome strategy using blast+ (blastn - searches nucleotide gene sequence against genome database) utility. But searching a gene sequence against predicted genes is a good strategy and would be more precise than searching against genome.

You can use .fasta file generated by prokka, which will contain predicted gene sequence for an individual organism.

NOTE: While searching your gene of interest against predicted genes, you may face a problem of unique identifier. Before making a gene database you have to change gene headers so that at the end you can track the source organism for the hit.

ADD REPLY
0
Entering edit mode
5.1 years ago

Ok so I partially solved this. The genomes i want to screen are saved as fna files and the genes i want to screen for are saved in txt files. Then I run this nested loop in Ubuntu that performs a Blastn of the genes against all genomes and creates an xls file for each gebe containing the results for the blastn of each genome. The word size is set to 28bp to mimic megablast for intraspecies screening.

mkdir results_folder
for txt_fil in *txt
do
echo $txt_fil
for fna_fil in *fna
do
echo $fna_fil
blastn -task blastn -subject ${fna_fil} -query $txt_fil -word_size 28 -outfmt "6 sseqid pident length gapopen qstart qend sstart send evalue bitscore" -out "$fna_fil".csv
echo Done with $fna_fil
done
cat *csv >> "$txt_fil".xls
rm *csv
mv *xls results_folder
echo Done with $txt_fil
done;
ADD COMMENT

Login before adding your answer.

Traffic: 1983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6