Question: Command line to download fasta sequences from PATRIC DB
0
gravatar for yarmda
11 months ago by
yarmda0
yarmda0 wrote:

I am seeking to download every available protein sequence for a series of organisms and all of their strains. PATRIC offers more strains than NCBI has listed (as far as I can tell, feel free to correct that and please indicate how I can find the corresponding sequences on NCBI), but I can't tell how to download them without doing so manually. For the 65 complete sequences of Bacillus anthracis, each has >5000 protein families. Copying and pasting after manually downloading them isn't a scalable solution.

NCBI has entrez. Does anyone know of some similar ability with PATRIC?

Link to PATRIC database for set of protein families belonging to one strain:

https://www.patricbrc.org/view/Genome/743835.4#view_tab=proteinFamilies

download patric command • 606 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by yarmda0

That table is downloadable (by selecting all via the check mark at top left corner) and then selecting "Download" (right corner) to download the table as text/CSV.

ADD REPLYlink written 11 months ago by genomax54k

That downloads the table. I want the fasta sequences. That is also not a command line option that would be scalable to an arbitrary number of organisms and their strains.

ADD REPLYlink written 11 months ago by yarmda0
1

Some of the data (that you see in the web front end) may be derived from primary data and there may be no way to download it automatically (you could try writing to the site owners to see if they can export some of the data on backend for you). All primary sequence data appears to be available via FTP at link below. There are tens of thousands of genomes and you may have to be patient as you download the data since that FTP site does not appear to be very fast.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax54k
2
gravatar for genomax
11 months ago by
genomax54k
United States
genomax54k wrote:

Data is available via FTP from PATRIC. faa would be the directory to look into if you need protein.

ADD COMMENTlink written 11 months ago by genomax54k

These files seem to have fewer proteins than were indicated by the protein families page for a given strain. That may be fine, but can you explain why? For instance, genome 1392.82 has a protein families page that would have 5400 or so sequences, while the file from the ftp stops around 4500 or so.

You replied to the original post about primary data - could that be the explanation?

ADD REPLYlink written 11 months ago by yarmda0

PATRIC appears to be defining additional coding sequences compared to RefSeq entries. There are sequence files for both in genome dirs.

genome_id   genome_name taxon_id    genome_length   genome_status   chromosomes plasmids    contigs patric_cds  refseq_cds
1392.82 Bacillus anthracis A0157    1392    5322244 Complete    1   1   1   5624    5349
ADD REPLYlink modified 11 months ago • written 11 months ago by genomax54k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1619 users visited in the last hour