Question

Where to obtain CDS from Archaea genomes?

0

Entering edit mode

4.3 years ago

Lettucehelen • 0

Hello!

I would like to know where can I obtain data for the CDS of Archaeal organisms. I have tried many databases like NCBI or Esembl but I could not find anything. Does anyone know where can I obtain them? I could only find complete genomes, but this is not useful for me since I need the different genes.

Thank you very much!

genome sequence Assembly assembly • 743 views

ADD COMMENT • link updated 4.3 years ago by Michael 54k • written 4.3 years ago by Lettucehelen • 0

score 0 · Answer 1 · 2020-01-07

0

Entering edit mode

4.3 years ago

Michael 54k

How about the taxonomy browser?

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2157&lvl=3&p=gene&lin=f&keep=1&srchmode=1&unlock

Then select the "gene" link on the right side. For non-splicing organisms, the assumption that CDS equal gene sequence is ok (except for some cases).

ADD COMMENT • link 4.3 years ago by Michael 54k

0

Entering edit mode

This is very helpful thank you!

However, in the "gene" link, there are only individual CDS for each organism (which is great!), but what I would need is a file with all the CDS of an organism together. Do you have any idea where could I obtain that?

I mean, I could do it by hand... but it would take ages to finish!

Thank you anyway for your response!

ADD REPLY • link 4.3 years ago by Lettucehelen • 0

0

Entering edit mode

So, on a general note, CDS != gene, but for your case, the difference might be negligible. So, could you please add to the specification of what you need: Do you want to download all (1) archeal sequences, (2) all below a certain Taxon or (3) a specific set of organisms, or (4) a single organism? 1-3 might be easiest to accomplish programmatically e.g. using NCBI e-Utils. Do you further need to restrict the sequence set to protein-coding sequences?

ADD REPLY • link 4.3 years ago by Michael 54k

0

Entering edit mode

You can export that full table. In the table you will notice that there is a column for Accession, Start and Stop. Those values can be parsed and then you can get the sequence by using EntrezDirect. I have truncated the sequences to save space. You will need to keep a watch for complement and use -strand directive there.

$ efetch -db nuccore -id NC_002754 -seq_start 2219975 -seq_stop 2221033 -strand minus -format fasta_cds_na
>lcl|NC_002754.1_cds_WP_009993137.1_1 [locus_tag=SSO_RS11880] [db_xref=GeneID:1453915] [protein=DNA polymerase IV] [protein_id=WP_009993137.1] [location=complement(2219975..2221033)] [gbkey=CDS]
ATGATTGTTCTTTTCGTTGATTTTGACTACTTTTACGCTCAAGTTGAAGAAGTTTTAAATCCGTCTTTGA
AAGGAAAACCAGTTGTTGTTTGTGTATTTTCAGGGAGATTTGAGGATAGCGGTGCTGTGGCTACTGCAAA
CTATGAAGCTAGAAAATTTGGAGTAAAAGCTGGAATACCAATCGTTGAGGCTAAGAAAATATTACCTAAT
GCAGTTTACTTACCCATGAGAAAGGAAGTATATCAGCAAGTTTCCAGTAGAATAATGAACTTACTAAGAG

$ efetch -db nuccore -id NC_002607 -seq_start 1089122 -seq_stop 1089910 -format fasta_cds_na
>lcl|NC_002607.1_cds_WP_010903069.1_1 [locus_tag=VNG_RS05715] [db_xref=GeneID:1448071] [protein=bacteriorhodopsin] [protein_id=WP_010903069.1] [location=1089122..1089910] [gbkey=CDS]
ATGTTGGAGTTATTGCCAACAGCAGTGGAGGGGGTATCGCAGGCCCAGATCACCGGACGTCCGGAGTGGA
TCTGGCTAGCGCTCGGTACGGCGCTAATGGGACTCGGGACGCTCTATTTCCTCGTGAAAGGGATGGGCGT
CTCGGACCCAGATGCAAAGAAATTCTACGCCATCACGACGCTCGTCCCAGCCATCGCGTTCACGATGTAC

Note 1: It may be possible to do this all in EntrezDirect but an immediate solution eludes me.
Note 2: If you want the protein sequence then replace fasta_cds_na with fasta_cds_aa

ADD REPLY • link 4.3 years ago by GenoMax 141k