Hello!
I would like to know where can I obtain data for the CDS of Archaeal organisms. I have tried many databases like NCBI or Esembl but I could not find anything. Does anyone know where can I obtain them? I could only find complete genomes, but this is not useful for me since I need the different genes.
Thank you very much!
This is very helpful thank you!
However, in the "gene" link, there are only individual CDS for each organism (which is great!), but what I would need is a file with all the CDS of an organism together. Do you have any idea where could I obtain that?
I mean, I could do it by hand... but it would take ages to finish!
Thank you anyway for your response!
So, on a general note, CDS != gene, but for your case, the difference might be negligible. So, could you please add to the specification of what you need: Do you want to download all (1) archeal sequences, (2) all below a certain Taxon or (3) a specific set of organisms, or (4) a single organism? 1-3 might be easiest to accomplish programmatically e.g. using NCBI e-Utils. Do you further need to restrict the sequence set to protein-coding sequences?
You can export that full table. In the table you will notice that there is a column for Accession, Start and Stop. Those values can be parsed and then you can get the sequence by using EntrezDirect. I have truncated the sequences to save space. You will need to keep a watch for
complement
and use-strand
directive there.Note 1: It may be possible to do this all in EntrezDirect but an immediate solution eludes me.
Note 2: If you want the protein sequence then replace
fasta_cds_na
withfasta_cds_aa