Question: batch download from ncbi mitochondrial resources
0
gravatar for medamato
6 weeks ago by
medamato0
medamato0 wrote:

Dear group, I am relatively new to these resources. I would like to download a batch of sequences from the mitochondrial reference sequences in ncbi. For instance. would like to get all 16S rRNA sequences from Felidae (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/Felidae) in fasta format. How do I extract and download this information? thnx. eugenia.

alignment sequence gene • 209 views
ADD COMMENTlink modified 5 weeks ago • written 6 weeks ago by medamato0

thank you genomax! it worked. slowly, got there. :)

ADD REPLYlink written 6 weeks ago by medamato0

Please upvote and accept the answer it was helpful. Thanks!

ADD REPLYlink written 6 weeks ago by ATpoint36k

Dear genomax I am asking for help again, I tried my self but my understanding of the scrip is not too good and could not find a solution in the entrez programming manual .

I have to keep analyzing different taxa and different genes, what terms of the code do I need to modify for this ? (sorry don't want to keep bothering you)

e.g. I tried now 16S and Carnivora (encompassing Felidae, dogs, ferrets, seals etc) , proceeding as above , modifying the output name, but doesn't not work. How do I change taxa of interest and gene of interest from full mitichondrial genomes ? thanks millions eugenia

ADD REPLYlink written 6 weeks ago by medamato0

There are 198 results for 16S/Carnivora and out of those 66 seem to have 16S annotations in 2 separate fields. Following should get you those entries.

awk -F '\t' '{if ($12 ~/NC/ && ($8 ~/16S/ || $6 ~/16S/)) print $12,$13,$14}' info.txt
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax85k

Dear genomax, thanks for your reply. I run the script and only gets a list of this sort in the screen:

NC_011124.1 1103 2677 NC_035814.1 1102 2669 NC_008417.1 1101 2677 NC_008420.1 1102 2679 ..etc

.. but i need a fasta file like the one that worked for felidae.

I appreciate your help. best regards. eugenia.

ADD REPLYlink written 5 weeks ago by medamato0
1

You need to replace the awk part before first | in command line in my answer with this code. Try this:

awk -F '\t' '{if ($12 ~/NC/ && ($8 ~/16S/ || $6 ~/16S/)) print $12,$13,$14}' info.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > carnivora.fa
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax85k

I have a related query, I am trying now to extract nuclear genes.

e.g. in genes I am searched using the terms: (18S ribosomal RNA) AND Aves

I recover a list of 71 entries. I use a similar code as above awk -F '\t' '{if ($12 ~/NC/ && $8 ~/(18S ribosomal RNA)/) print $12,$13,$14}' info18Aves.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18.fa

but I can recover only the sequences with chromosomal location although all others are annotated too. How can I recover all 71 sequences ? thnx eugenia

ADD REPLYlink written 5 weeks ago by medamato0

Only 23 of the entries that have 18S have chromosomal locations assigned. You can get those entries by

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax85k

thnx genomax. I tried your code followed by how to wrote the fasta:

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db genome -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18b.fa

I get a very large message starting with

400 Bad Request No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=genome&id=NC_044276.1&rettype =fasta&retmode=text&seq_start=23750946&seq_stop=23752773&edirect_os=MSWin32&edirect=13.6&tool=edirect&email=NBDLN587A3839+A dmin@NBDLN587A3839.ad.uwc.ac.za'

.... its much larger, further below it continues

I appreciate your help

ADD REPLYlink written 5 weeks ago by medamato0

You need to use your own file name in place of gene_result.txt. That is the name I had saved my search for 18S rRNA and aves with.

ADD REPLYlink written 5 weeks ago by genomax85k

Thanks genomax I did so, and did not work. In order to prevent typing mistakes I saved the file with your name and tried again. still get the same error messages

ADD REPLYlink written 5 weeks ago by medamato0

Not sure what the problem is on your end. Works for me. I am only showing some fasta headers.

$ awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' | grep ">"

>NC_042584.1:3287564-3293084 Lonchura striata domestica isolate Mets1 chromosome 19, lonStrDom2, whole genome shotgun sequence
>NC_031787.1:2219507-2225070 Parus major isolate Abel chromosome 19, Parus_major1.1, whole genome shotgun sequence
>NC_045495.1:7909251-7927337 Corvus moneduloides isolate bCorMon1 chromosome 20, bCorMon1.pri, whole genome shotgun sequence
>NC_034426.1:173861-179320 Numida meleagris isolate 19003 breed g44 Domestic line chromosome 18, NumMel1.0, whole genome shotgun sequence
>NC_029534.1:71731-76431 Coturnix japonica isolate 7356 chromosome 19, Coturnix japonica 2.1, whole genome shotgun sequence
>NC_044264.1:3317928-3323125 Calypte anna isolate BGI_N300 chromosome 19, bCalAnn1_v1.p, whole genome shotgun sequence
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax85k

in the line of error that I transcribed there is something very strange,

*+A dmin@NBDLN587A3839.ad.uwc.ac.za'*

uwc.ac.za is the end of my university address , however I logged in ncbi with my gmail personal account.

is it possible that this is some kind of issue generated by the university server ? thnx eugenia

ADD REPLYlink written 5 weeks ago by medamato0

My command lines work on linux. Are you now using windows? Windows Subsystem for Linux on windows? Was there some change with local firewall? Is it preventing your downloads?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax85k

I am using Cygwin I tried to install Ubuntu but something prevented from doing so, windows updates only if we are on campus. We are still on lockdown, working from home. maybe something is missing in Windows updates ? I will consult with the university . thnx a lot for all your help. I will let you know

ADD REPLYlink written 5 weeks ago by medamato0

dear genomax, I have no idea what happened, restarted the computer and the code is working now but only recovers 6 sequences, not 23 (out of the 71 Aves 18S)

awk -F '\t' '{if ($0 ~/NC/ && $0 ~/18S/) print $12,$13,$14}' gene_result.txt | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > Aves18cc.fa

thnx. eugenia

ADD REPLYlink written 5 weeks ago by medamato0
3
gravatar for genomax
6 weeks ago by
genomax85k
United States
genomax85k wrote:
  1. Search with term 16S rRNA AND felidae at NCBI.
  2. Click on Gene results (there are 57 at this time)
  3. Use tabular format, show all results on page, save the results in a file (info.txt).
  4. Use the following code to extract accession of mitochondrial genome reference and EntrezDirect to get the fasta format sequence. There should be 27 sequences that are annoated as 16S rRNA.

Code

awk -F '\t' '{if ($12 ~/NC/ && $8 ~/16S/) print $12,$13,$14}' info.txt  | xargs -n 3 sh -c 'efetch -db nuccore -id $0 -seq_start $1 -seq_stop $2 -format fasta' > felidae_fasta.fa
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by genomax85k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1610 users visited in the last hour