Question: Command line to download fasta sequences from PATRIC DB
8 months ago
I am seeking to download every available protein sequence for a series of organisms and all of their strains. PATRIC offers more strains than NCBI has listed (as far as I can tell, feel free to correct that and please indicate how I can find the corresponding sequences on NCBI), but I can't tell how to download them without doing so manually. For the 65 complete sequences of Bacillus anthracis, each has >5000 protein families. Copying and pasting after manually downloading them isn't a scalable solution.

NCBI has entrez. Does anyone know of some similar ability with PATRIC?

Link to PATRIC database for set of protein families belonging to one strain:

That table is downloadable (by selecting all via the check mark at top left corner) and then selecting "Download" (right corner) to download the table as text/CSV.

That downloads the table. I want the fasta sequences. That is also not a command line option that would be scalable to an arbitrary number of organisms and their strains.

Some of the data (that you see in the web front end) may be derived from primary data and there may be no way to download it automatically (you could try writing to the site owners to see if they can export some of the data on backend for you). All primary sequence data appears to be available via FTP at link below. There are tens of thousands of genomes and you may have to be patient as you download the data since that FTP site does not appear to be very fast.

8 months ago
Data is available via FTP from PATRIC. faa would be the directory to look into if you need protein.

These files seem to have fewer proteins than were indicated by the protein families page for a given strain. That may be fine, but can you explain why? For instance, genome 1392.82 has a protein families page that would have 5400 or so sequences, while the file from the ftp stops around 4500 or so.

You replied to the original post about primary data - could that be the explanation?

PATRIC appears to be defining additional coding sequences compared to RefSeq entries. There are sequence files for both in genome dirs.

genome_id   genome_name taxon_id    genome_length   genome_status   chromosomes plasmids    contigs patric_cds  refseq_cds
1392.82 Bacillus anthracis A0157    1392    5322244 Complete    1   1   1   5624    5349
