Question

how to filter microbial genome based on quality score

0

Entering edit mode

3.9 years ago

Bioinfonext ▴ 460

I have downloaded the microbial genome using the repophlan_get_microbes.py. and got four folder:

faa  
ffn 
fna  
frn

With in the fna folder I got the files like this:

G001284865.fna.bz2  G002910165.fna.bz2  G009390615.fna.bz2
G001284885.fna.bz2  G002910195.fna.bz2  G009390655.fna.bz2..

...........

Could you please help me now I can filter these genome files based on the quality score? as they have shown in this publication (https://www.nature.com/articles/s41586-020-2095-1)

and further how I can make single nucleotide database to do blastn?

Many thanks for your help and time.

metagenomics repophlan • 1.1k views

ADD COMMENT • link 3.9 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Could you please help me now I can filter these genome files based on the quality score?

How so? .fna should be fasta format sequence files without any associated quality information/scores.

further how I can make single nucleotide database to do blastn?

Since these are regular fasta files. It should be straightforward to make blast databases using makeblastdb. Not sure what you mean by single nucleotide.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

thanks genomax for quick help!

In the ablove link publication they have mentioned that " A total of 71,782 microbial genomes were downloaded using RepoPhlan (https://bitbucket.org/nsegata/repophlan) on 14 June 2016, of which 5,503 were viral and 66,279 were bacterial or archaeal. On the basis of prior literature, bacterial and archaeal genomes were filtered for quality scores of 0.8 or better58, which left 54,471 of them for subsequent analysis, or a total of 59,974 microbial genomes".

But did not mention how to find the quality score and then how they filtered it?

there is also script on RepoPhlan: screen.py (https://bitbucket.org/nsegata/repophlan/src/default/) but not sure what is the use of this script and how I should use it?

Many thanks bioinfonext

ADD REPLY • link 3.9 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I have downloaded the mcrobial genome using repophlan_get_microbes.py script (https://bitbucket.org/nsegata/repophlan/src/default/). now I am trying to get the quality score for downloaded genome using the screen.py but I am getting an error? do I need to download any other tool?

Could you please help me how I can resolve this although I have downloaded the pfam directory from the above link?

$ python screen.py --in_summary repophlan_microbes.txt --out_summary repophlan_microbes_wscores.txt --hmm pfam/102.hmm
usage: screen.py [-h] [--nproc NPROC] --in_summary IN_SUMMARY --out_summary
OUT_SUMMARY
screen.py: error: unrecognized arguments: --hmm pfam/102.hmm

Many thanks nabiyogesh

ADD REPLY • link 3.8 years ago by Bioinfonext ▴ 460

1

Entering edit mode

As you can see from the usage statement this program does not seem to have a --hmm option. I think you just need to run (from run.sh file)

python screen.py --nproc 10 --in_summary out/repophlan_microbes_${t}.txt --out_summary out/repophlan_microbes_${t}_wscores.txt

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks @genomax, now it is working well without -hmm flag.

ADD REPLY • link 3.8 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi @genomax,

I got multiple bacterial genome after quality filtering (around 54000) and each of these genome fasta files are in an individual folder, could you please suggest how I can make a single database for blastn.

Earlier, I used makeblastbd for making database using below command:

makeblastdb -dbtype nucl -in GUS.fasta -out GUS_db

Genome fasta files are located in fna folder;

fna/

With in the fna folder I got the files like this:

G001284865.fna.bz2  G002910165.fna.bz2  G009390615.fna.bz2
G001284885.fna.bz2  G002910195.fna.bz2  G009390655.fna.bz2

Many thanks

ADD REPLY • link 3.7 years ago by Bioinfonext ▴ 460