batch download from ncbi mitochondrial resources
1
0
Entering edit mode
5.1 years ago
medamato • 0

Dear group, I am relatively new to these resources. I would like to download a batch of sequences from the mitochondrial reference sequences in ncbi. For instance. would like to get all 16S rRNA sequences from Felidae (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/Felidae) in fasta format. How do I extract and download this information? thnx. eugenia.

sequence gene alignment • 1.9k views
ADD COMMENT
0
Entering edit mode

thank you genomax! it worked. slowly, got there. :)

ADD REPLY
0
Entering edit mode

Please upvote and accept the answer it was helpful. Thanks!

ADD REPLY
0
Entering edit mode

Dear genomax I am asking for help again, I tried my self but my understanding of the scrip is not too good and could not find a solution in the entrez programming manual .

I have to keep analyzing different taxa and different genes, what terms of the code do I need to modify for this ? (sorry don't want to keep bothering you)

e.g. I tried now 16S and Carnivora (encompassing Felidae, dogs, ferrets, seals etc) , proceeding as above , modifying the output name, but doesn't not work. How do I change taxa of interest and gene of interest from full mitichondrial genomes ? thanks millions eugenia

ADD REPLY
0
Entering edit mode

There are 198 results for 16S/Carnivora and out of those 66 seem to have 16S annotations in 2 separate fields. Following should get you those entries.

awk -F '\t' '{if ($12 ~/NC/ && ($8 ~/16S/ || $6 ~/16S/)) print $12,$13,$14}' info.txt
ADD REPLY
0
Entering edit mode

Dear genomax, thanks for your reply. I run the script and only gets a list of this sort in the screen:

NC_011124.1 1103 2677 NC_035814.1 1102 2669 NC_008417.1 1101 2677 NC_008420.1 1102 2679 ..etc

.. but i need a fasta file like the one that worked for felidae.

I appreciate your help. best regards. eugenia.

ADD REPLY
1
Entering edit mode

You need to replace the awk part before first | in command line in my answer with this code. Try this:

awk -F '\t' '{if ($12 ~/NC/ && ($8 ~/16S/ || $6 ~/16S/)) print $12,$13,$14}' info.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > carnivora.fa
ADD REPLY
0
Entering edit mode

I have a related query, I am trying now to extract nuclear genes.

e.g. in genes I am searched using the terms: (18S ribosomal RNA) AND Aves

I recover a list of 71 entries. I use a similar code as above awk -F '\t' '{if ($12 ~/NC/ && $8 ~/(18S ribosomal RNA)/) print $12,$13,$14}' info18Aves.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18.fa

but I can recover only the sequences with chromosomal location although all others are annotated too. How can I recover all 71 sequences ? thnx eugenia

ADD REPLY
0
Entering edit mode

Only 23 of the entries that have 18S have chromosomal locations assigned. You can get those entries by

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt
ADD REPLY
0
Entering edit mode

thnx genomax. I tried your code followed by how to wrote the fasta:

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db genome -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18b.fa

I get a very large message starting with

400 Bad Request No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=genome&id=NC_044276.1&rettype =fasta&retmode=text&seq_start=23750946&seq_stop=23752773&edirect_os=MSWin32&edirect=13.6&tool=edirect&email=NBDLN587A3839+A dmin@NBDLN587A3839.ad.uwc.ac.za'

.... its much larger, further below it continues

I appreciate your help

ADD REPLY
0
Entering edit mode

You need to use your own file name in place of gene_result.txt. That is the name I had saved my search for 18S rRNA and aves with.

ADD REPLY
0
Entering edit mode

Thanks genomax I did so, and did not work. In order to prevent typing mistakes I saved the file with your name and tried again. still get the same error messages

ADD REPLY
0
Entering edit mode

Not sure what the problem is on your end. Works for me. I am only showing some fasta headers.

$ awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' | grep ">"

>NC_042584.1:3287564-3293084 Lonchura striata domestica isolate Mets1 chromosome 19, lonStrDom2, whole genome shotgun sequence
>NC_031787.1:2219507-2225070 Parus major isolate Abel chromosome 19, Parus_major1.1, whole genome shotgun sequence
>NC_045495.1:7909251-7927337 Corvus moneduloides isolate bCorMon1 chromosome 20, bCorMon1.pri, whole genome shotgun sequence
>NC_034426.1:173861-179320 Numida meleagris isolate 19003 breed g44 Domestic line chromosome 18, NumMel1.0, whole genome shotgun sequence
>NC_029534.1:71731-76431 Coturnix japonica isolate 7356 chromosome 19, Coturnix japonica 2.1, whole genome shotgun sequence
>NC_044264.1:3317928-3323125 Calypte anna isolate BGI_N300 chromosome 19, bCalAnn1_v1.p, whole genome shotgun sequence
ADD REPLY
0
Entering edit mode

in the line of error that I transcribed there is something very strange,

*+A dmin@NBDLN587A3839.ad.uwc.ac.za'*

uwc.ac.za is the end of my university address , however I logged in ncbi with my gmail personal account.

is it possible that this is some kind of issue generated by the university server ? thnx eugenia

ADD REPLY
0
Entering edit mode

My command lines work on linux. Are you now using windows? Windows Subsystem for Linux on windows? Was there some change with local firewall? Is it preventing your downloads?

ADD REPLY
0
Entering edit mode

I am using Cygwin I tried to install Ubuntu but something prevented from doing so, windows updates only if we are on campus. We are still on lockdown, working from home. maybe something is missing in Windows updates ? I will consult with the university . thnx a lot for all your help. I will let you know

ADD REPLY
0
Entering edit mode

dear genomax, I have no idea what happened, restarted the computer and the code is working now but only recovers 6 sequences, not 23 (out of the 71 Aves 18S)

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18cc.fa

thnx. eugenia

ADD REPLY
3
Entering edit mode
5.1 years ago
GenoMax 152k
  1. Search with term 16S rRNA AND felidae at NCBI.
  2. Click on Gene results (there are 57 at this time)
  3. Use tabular format, show all results on page, save the results in a file (info.txt).
  4. Use the following code to extract accession of mitochondrial genome reference and EntrezDirect to get the fasta format sequence. There should be 27 sequences that are annoated as 16S rRNA.

Code

awk -F '\t' '{if ($12 ~/NC/ && $8 ~/16S/) print $12,$13,$14}' info.txt  | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > felidae_fasta.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6