Where to obtain CDS from Archaea genomes?
1
0
Entering edit mode
4.3 years ago

Hello!

I would like to know where can I obtain data for the CDS of Archaeal organisms. I have tried many databases like NCBI or Esembl but I could not find anything. Does anyone know where can I obtain them? I could only find complete genomes, but this is not useful for me since I need the different genes.

Thank you very much!

genome sequence Assembly assembly • 743 views
ADD COMMENT
0
Entering edit mode
4.3 years ago
Michael 54k

How about the taxonomy browser?

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=2157&lvl=3&p=gene&lin=f&keep=1&srchmode=1&unlock

Then select the "gene" link on the right side. For non-splicing organisms, the assumption that CDS equal gene sequence is ok (except for some cases).

ADD COMMENT
0
Entering edit mode

This is very helpful thank you!

However, in the "gene" link, there are only individual CDS for each organism (which is great!), but what I would need is a file with all the CDS of an organism together. Do you have any idea where could I obtain that?

I mean, I could do it by hand... but it would take ages to finish!

Thank you anyway for your response!

ADD REPLY
0
Entering edit mode

So, on a general note, CDS != gene, but for your case, the difference might be negligible. So, could you please add to the specification of what you need: Do you want to download all (1) archeal sequences, (2) all below a certain Taxon or (3) a specific set of organisms, or (4) a single organism? 1-3 might be easiest to accomplish programmatically e.g. using NCBI e-Utils. Do you further need to restrict the sequence set to protein-coding sequences?

ADD REPLY
0
Entering edit mode

You can export that full table. In the table you will notice that there is a column for Accession, Start and Stop. Those values can be parsed and then you can get the sequence by using EntrezDirect. I have truncated the sequences to save space. You will need to keep a watch for complement and use -strand directive there.

$ efetch -db nuccore -id NC_002754 -seq_start 2219975 -seq_stop 2221033 -strand minus -format fasta_cds_na
>lcl|NC_002754.1_cds_WP_009993137.1_1 [locus_tag=SSO_RS11880] [db_xref=GeneID:1453915] [protein=DNA polymerase IV] [protein_id=WP_009993137.1] [location=complement(2219975..2221033)] [gbkey=CDS]
ATGATTGTTCTTTTCGTTGATTTTGACTACTTTTACGCTCAAGTTGAAGAAGTTTTAAATCCGTCTTTGA
AAGGAAAACCAGTTGTTGTTTGTGTATTTTCAGGGAGATTTGAGGATAGCGGTGCTGTGGCTACTGCAAA
CTATGAAGCTAGAAAATTTGGAGTAAAAGCTGGAATACCAATCGTTGAGGCTAAGAAAATATTACCTAAT
GCAGTTTACTTACCCATGAGAAAGGAAGTATATCAGCAAGTTTCCAGTAGAATAATGAACTTACTAAGAG

$ efetch -db nuccore -id NC_002607 -seq_start 1089122 -seq_stop 1089910 -format fasta_cds_na
>lcl|NC_002607.1_cds_WP_010903069.1_1 [locus_tag=VNG_RS05715] [db_xref=GeneID:1448071] [protein=bacteriorhodopsin] [protein_id=WP_010903069.1] [location=1089122..1089910] [gbkey=CDS]
ATGTTGGAGTTATTGCCAACAGCAGTGGAGGGGGTATCGCAGGCCCAGATCACCGGACGTCCGGAGTGGA
TCTGGCTAGCGCTCGGTACGGCGCTAATGGGACTCGGGACGCTCTATTTCCTCGTGAAAGGGATGGGCGT
CTCGGACCCAGATGCAAAGAAATTCTACGCCATCACGACGCTCGTCCCAGCCATCGCGTTCACGATGTAC

Note 1: It may be possible to do this all in EntrezDirect but an immediate solution eludes me.
Note 2: If you want the protein sequence then replace fasta_cds_na with fasta_cds_aa

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6