Question: Command line to download fasta sequences from PATRIC DB
gravatar for yarmda
3.0 years ago by
yarmda40 wrote:

I am seeking to download every available protein sequence for a series of organisms and all of their strains. PATRIC offers more strains than NCBI has listed (as far as I can tell, feel free to correct that and please indicate how I can find the corresponding sequences on NCBI), but I can't tell how to download them without doing so manually. For the 65 complete sequences of Bacillus anthracis, each has >5000 protein families. Copying and pasting after manually downloading them isn't a scalable solution.

NCBI has entrez. Does anyone know of some similar ability with PATRIC?

Link to PATRIC database for set of protein families belonging to one strain:

download patric command • 2.4k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by yarmda40

That table is downloadable (by selecting all via the check mark at top left corner) and then selecting "Download" (right corner) to download the table as text/CSV.

ADD REPLYlink written 3.0 years ago by genomax89k

That downloads the table. I want the fasta sequences. That is also not a command line option that would be scalable to an arbitrary number of organisms and their strains.

ADD REPLYlink written 3.0 years ago by yarmda40

Some of the data (that you see in the web front end) may be derived from primary data and there may be no way to download it automatically (you could try writing to the site owners to see if they can export some of the data on backend for you). All primary sequence data appears to be available via FTP at link below. There are tens of thousands of genomes and you may have to be patient as you download the data since that FTP site does not appear to be very fast.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by genomax89k
gravatar for genomax
3.0 years ago by
United States
genomax89k wrote:

Data is available via FTP from PATRIC. faa would be the directory to look into if you need protein.

ADD COMMENTlink written 3.0 years ago by genomax89k

These files seem to have fewer proteins than were indicated by the protein families page for a given strain. That may be fine, but can you explain why? For instance, genome 1392.82 has a protein families page that would have 5400 or so sequences, while the file from the ftp stops around 4500 or so.

You replied to the original post about primary data - could that be the explanation?

ADD REPLYlink written 3.0 years ago by yarmda40

PATRIC appears to be defining additional coding sequences compared to RefSeq entries. There are sequence files for both in genome dirs.

genome_id   genome_name taxon_id    genome_length   genome_status   chromosomes plasmids    contigs patric_cds  refseq_cds
1392.82 Bacillus anthracis A0157    1392    5322244 Complete    1   1   1   5624    5349
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by genomax89k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1580 users visited in the last hour