Question: Command line to download fasta sequences from PATRIC DB
gravatar for yarmda
8 months ago by
yarmda0 wrote:

I am seeking to download every available protein sequence for a series of organisms and all of their strains. PATRIC offers more strains than NCBI has listed (as far as I can tell, feel free to correct that and please indicate how I can find the corresponding sequences on NCBI), but I can't tell how to download them without doing so manually. For the 65 complete sequences of Bacillus anthracis, each has >5000 protein families. Copying and pasting after manually downloading them isn't a scalable solution.

NCBI has entrez. Does anyone know of some similar ability with PATRIC?

Link to PATRIC database for set of protein families belonging to one strain:

download patric command • 445 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by yarmda0

That table is downloadable (by selecting all via the check mark at top left corner) and then selecting "Download" (right corner) to download the table as text/CSV.

ADD REPLYlink written 8 months ago by genomax47k

That downloads the table. I want the fasta sequences. That is also not a command line option that would be scalable to an arbitrary number of organisms and their strains.

ADD REPLYlink written 8 months ago by yarmda0

Some of the data (that you see in the web front end) may be derived from primary data and there may be no way to download it automatically (you could try writing to the site owners to see if they can export some of the data on backend for you). All primary sequence data appears to be available via FTP at link below. There are tens of thousands of genomes and you may have to be patient as you download the data since that FTP site does not appear to be very fast.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax47k
gravatar for genomax
8 months ago by
United States
genomax47k wrote:

Data is available via FTP from PATRIC. faa would be the directory to look into if you need protein.

ADD COMMENTlink written 8 months ago by genomax47k

These files seem to have fewer proteins than were indicated by the protein families page for a given strain. That may be fine, but can you explain why? For instance, genome 1392.82 has a protein families page that would have 5400 or so sequences, while the file from the ftp stops around 4500 or so.

You replied to the original post about primary data - could that be the explanation?

ADD REPLYlink written 8 months ago by yarmda0

PATRIC appears to be defining additional coding sequences compared to RefSeq entries. There are sequence files for both in genome dirs.

genome_id   genome_name taxon_id    genome_length   genome_status   chromosomes plasmids    contigs patric_cds  refseq_cds
1392.82 Bacillus anthracis A0157    1392    5322244 Complete    1   1   1   5624    5349
ADD REPLYlink modified 8 months ago • written 8 months ago by genomax47k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 841 users visited in the last hour