How to download large-scale cDNA sequences from NCBI effectively ?
1
0
Entering edit mode
5 weeks ago
Sony ▴ 10

Hello everyone,

I would like to download cDNA dataset from this published paper. In this paper, the cDNA dataset was submitted in EMBL database. I was not able to download these cDNA sequences in Fasta format from EMBL, but it worked in NCBI. here is the Accession numbers for submitted data can access in NCBI: Oryza sativa ssp. indica cv. Guangluai 4 full-length cDNAs (10,096)

CT827960-CT834770, CT836522-CT836598, CT837477-CT837976, CT834771-CT836521, CT827880-CT827943, CT836599-CT837476

I tried to download it manually in the NCBI website. I want to use script or command line to automatic download 10,096 cDNA sequences above because it not effective incase I download it manually. I create a script to download these dataset, but it not work like my expectation. The download files are log files, not in fasta format.

Here is my script:

#!/bin/bash

start_accession=827960
end_accession=834770

for (( i=$start_accession; i<=$end_accession; i++ )); do
    genbank_number="$i.1"
    download_url="https://www.ncbi.nlm.nih.gov/search/api/download-sequence/?db=nuccore&id=$genbank_number"
    echo "Downloading genbank number $genbank_number..."
    wget "$download_url"
    echo "Download of genbank number $genbank_number completed."
done

Does anyone have any guidance for me in this case? Thank you everyone for any support.

NCBI. cDNA • 446 views
ADD COMMENT
2
Entering edit mode
5 weeks ago

I think your mixing accession and ncbi gi (depreacted I think).

your first 'i' is start_accession=827960 look like a gid

then you add a kind of version: genbank_number="$i.1" (looks like an accession number now...)

I think you want something like:

echo CT827960-CT834770,CT836522-CT836598,CT837477-CT837976,CT834771-CT836521,CT827880-CT827943,CT836599-CT837476 \
 tr ",-" "\n" |\
cut -c3- |\
paste - - |\
while read A B ; do  for i in $(seq $A 1 $B ); do echo $i; done ; done |\
sed 's/^/CT/' |\
xargs -L 100 echo |\
tr " " "," |\
awk '{ print "https://www.ncbi.nlm.nih.gov/search/api/download-sequence/?db=nuccore&id="$0}' |\
xargs -L 1 wget -O -

>CT827960.1 Oryza sativa (indica cultivar-group) cDNA clone:OSIGCPI031I06, full insert sequence
CTTTGATGATTCTTCTTCTTCCTCCTCATTCAGATATGGATCCTCTGCAACGACTGCGGCGCGACGTCGA
ACGTGAACTTCCACGTCCTGGCGCAGAAGTGCCCCGGATGCAGCTCCTACAACACCCGGGAGACCAGAGG
CTGCGGCCGCCCTGCAGCCGCGCGCTCCACGGTTTGATTTCAGCAGCAGCAGCAGAGACGAGTTGGCATC
CATCTCACAAACTAAGGATGAAATCGAGAGCGACAAACAAGATGCAGAGACGGCTTCCTCTGAACTTAGC
CGTCGAGCAAAAGCTGCAAAATCGATCGGCGTCGAGTTTGGTAGACACTTTGCGCCAAGAGGAGTATGGT
GATTTTGGCGCAGTATGCAGCGAGTTGAATAGCCCATATATGTTGTGTTTTCTCTCTCTCTCTTTTTTGT
GAGGATATATGTTATGTTTTGAAACTCCAACTATTATTATTACTAAATGATACTCCTAATAAAAGAGAGA
CATCTTCTCAAG

>CT827961.1 Oryza sativa (indica cultivar-group) cDNA clone:OSIGCRA205K13, full insert sequence
AAATGGCTGGAGCAGCAGATGGTGAGGGTCTGAGGCCGTTGCCATCCCGCCGGAGCTTACCACCCTCTTC
GTCAGACCTGGTGCCTGCAACAACAGTGAGCTGCTTCTGGTTCCCGGCATCGACGTCTCCCACAGCTCGT
TCTTCAACCGCGCCGACGCCCCAGGCGCCCACGGTGCCCCTGCCGGCTTCCTCGACACTTTCGACGTCGC
CATCAACGGCGCGCTCCGCGCCGCTCCTGCCGCGGCTTCACCGGCTTTCCTCCTGCCGAACCTCAACGAT
GACGCGACTGCGACTCTCCACGCCCAGGCCGTCGCCGTTCTGAACCACGACCGCAGATTCGGGTGGCGCT
GTCATGGAAGGTGGAGAAGAACCGGGGGTGGAGCCGTCGGCATCGGCGACTACAGAGGAACCGACAGCGA
GGTGGAGCAGAAGTCAGCGAGGACGGAGGGAGTGGGTGGTTAAGCCGTGGTGCACTTCCGACGGGGGCGC
CGAGGCAACGGGTGGTTGGTGCATGTGTCCCAGCCTCCGGTGGCAGCAGCGGCCTCAACCCTCAGATCTG
GGGCGGCACAGAGGAAGTTACATGCCAGAAGCCGGCGGCGACAAGAAGCAGCATGGCGGAGATGGAGGCT
GGTGATAAAAAGAACAGCGAATGGTGGAGGCCGTGGCAATGTGAAACAATGTGGACAGCAAAATCTGGAG
ATGGAATCAACAAATGGACCATTACCTCGCTAGCTCCCTTGAGGCTTCTTCCATTGTCAAGATCCAAATC
AACAACAGCCCCCCCCAACCTTTCAATTGAGCCATTGAGGAGTGAAACGACTGCCGACATGCAACGAAAC
GGAGCGGCCAAGAAACACTACGAACTCCACCACGGCCGGCTAGCTTCTGATCTGGCCACTGCCACGACGC
CACACCAGCACACGCTCTCGTGAAACCCTAGGCCCCTCTCCCGCAGGTGTGGCCTTCCCCGACACCGAGC
CTTCACGCGACGGCGGAAGAGATGACCGGCGGCCTGCGGCCACTACTTATTAGGGGAGGATGTGCATCAT
CCGGTGTCATTGCTCTGTGCGCGGGAATAACGCGGCAAGCGGTGGCGGGGATAGCGCTCGAGGCGGTGTT
ACTGCGAAAGGGGTTCTCGCGCTCGCCCGGTGGGATGAGGAATCGGTGATTTCACCGCTACACAGAGGGT
GGGGACATGGGGATTGATTTCACCTGAAGGATTGCTAGTAAGAACTCGTCCATTTTATATTAGTATAGAT
ATAGATAGACCATGAAACCTGTGATTATTTGCCATGCTTCATAATATATGTGGTTTTGTGCAAATTTAAC
CTACCCTCCTGGGCCTGGAAACTACAATAGTTATGGGCCATCACAGGGCCCAAATTATGGACAACCTCAG
TATCCGCAGTCTGCACCTCCACAGAACTATGGGCCTGGTTATGGTGATCCTAGATACAATGCTCCAGCAC
CAAACCAGCAGTACTATGGACAGCCTCCAGCGGGTCCACAGCAAGGCTACCCTCCACAGCAAGATCCCTA
CGCTAGGCCTTATGGTGGACCTGGGACATGGGCACCCAGAGGTGCACCAGCCGGAGATGGCACTTACCAG
GCGCCACCACCTACATCTTATGGCCCACCATCTCAGCAGCCTCCTGCTTATGGTCAGACATATGGGCCAA
CGACTGGACCTTGATGGGGATTTTCAGCAAAAGTTCCCCCCAGCAAAGTGCCCAAGCGCCAACAACAATA
TGGTCAGAGTGCCCCACCAGGGCCAGGGTATGTTCAACAAGGCGCACAGCAAGGGGGTTATGCACAGTAT
CTTCAATCCCAACCAGCATATGGTGATCAAGCAGCTCAAAACAATGCAAACTACGGCTACCAGGGTGCTC
CAGCAGATCCCAACTATGGAAATGCCTACCCACAGGCAGGATACGGTTCTACTCCGGCTAGTGGCCAGGC
TGGATATGCTGCTGCACCGGCTGCTGGCCAGCCAGGGTACGGTCAGCCAGGATACACTCAGCCACCTACA
AATCCACCAGCTTATGATCAGTCTGCCCAGCCACCAGCTCAGAGTGGCTATGCTGCACCTCCTGCAAACC
CACAGCCTGCTGTTGCAAAGGGGGTGTCACCGCAGCCTGCTGGATATGGTGGACAATGGACCGCTTGAGG
TTTGTCCCTCATTATTGACAGCAATGATCTAGTTGAAGACTATGTTTTGCCTCATGATGCTGCCGCTTAT
ATGAAGTAGGCGGTTGAATCCCCTTGGGATGTTCATTCAGTAAGCGGTAGACTTTTGATATGCCTATAAG
GGATGTAACCCCTTGCCTCTCCAGTTGTTATACCGGATCTCTGTAGTAGTTAGTAGTTTGTTAAGATGAC
ATAAAACCTCCTGTTTAGTTTAAAAGTGAACCGAATTATGTGTTATTCTGCAGCATGTCGACTGATGTTT
GGATGCTTAGTCCTAAAAAAAAAAG

>CT827962.1 Oryza sativa (indica cultivar-group) cDNA clone:OSIGCSA036O11, full insert sequence
TAATCTGGAGGCCGTATTTCATGAATACGAGAGGGAGAGTTACAATAAGCTGATTGCCGACATCGAAGCA
CATCCGAACAAAGCAGTTCAGAATGTATTGAAATCCTTCCTGCACAAGATCTACAAGAGGCAGAAGTAGA
GCTAAGCTCATGGAGAAGCTGTTTCATGTTTGCTTGGTAACTAGAGTCGTGGGGACAAATAACTGGTAAC
TAGAGTCATGGGGACAAATAACTGTTCCCTGATGTTGTGTGTATTATGGTTATGTTTGTACCGTGTAGTA
CAGCGTGCTACTCCGTAAAATAATGAAGCATGGTGCTATTTATGCGTGCGTGAACTGCTTGTGTCATT

>CT827963.1 Oryza sativa (indica cultivar-group) cDNA clone:OSIGCSN016D10, full insert sequence
ATCCATATTCCGTCCGTCAGCTACTGCTAGTGGTAGGCTAGATCATCGATGAACGCCATGGATGAGGAGG
AGGAGCAAGAGCAGCCTCCCCAGCGCTACTGGTTCCCGTACTGGACCAGCCCTCCACCGCCTCCGCCCTC
CAGCTCCAGGTACAGGCCGCCGTCACCTCCCTCATCGCGCCATCCCCACCCAACCATCCCAGCTGCCCGC
GCCGCACCACCGCTCGGGCCAACCAACCGCCGCTTGCATCAGCAGCCGCCGCCACCAGCAAGCAGAGATG
GTCGTCACGAGCCTCCTCCCAAGCCCAAGGACGTCGTCGTCATCCCCACCGACACCGTACTGCATCACAA
ACAACCACCACCCACGCATCATCATCAGCACAAGGTGAAAGATCAGGAGGAGAAGAAGGGCGACCTGCGC
AAGGACCTCAAGGCGGGGCTCGCCGGCATGCTCAGCGCCGCCTCCCACGGCCAGCAAGGGACAAGCATCA
TCACGCTGGCCGGCGACAACAAGGGCGCATCCATGAAGATATCCTCCCCCGCCCCAGGCAGCAAGGGCGC
CGGCGACGACAAGAGAAGCAAGGGGGGCGTGAAGGCGATGATCAACAGCAACGTGCAGTCCATCAACAAC
TCGCTGCTTCTCCACAGCTCCTGCAGCGGCGGCGACCTCGGGGTGCACCTCAAGCTCAAGCTCTCCTCAA
ACTCCAAGTCCAAGTCCAAGACCAAGAGCAAGGAGAAGCAGCAGCATAATGTCGTCGCCGATACCAGCAA
CAAGGAGAAGAAGCCCGATAGCAGCCAGGAGAAGAAGGAGGCTGGTGCCAGCGCCGCCAAACCCAACAAG
CCATCCGCCGCTGCCAAAGGCAACAAGCCCGCCGGTGCAGCTAACAAGTGATTCTGCAGACATACTAATG
TATGTATGTGCTTTGTACTGATCTGATTGCCTTCGCCTCATCATAATGATAATCGAATTAAATTTGCGGT
GT
ADD COMMENT
0
Entering edit mode

Thank you so much for your valuable guidance. I tried and it worked.

I attached the adjusted script in case someone need it.

#!/bin/bash

# Define the list of accession number ranges
accession_ranges="CT841557-CT841684 CT841686-CT841707 CT841710-CT841954 CT841956-CT842008 CU405560-CU405627 CU405629-CU405654 CU405656-CU405706 CU405708-CU4>

# Function to convert accession range to individual accession numbers
convert_accession_range_to_list() {
    echo $1 | tr ",-" "\n" | cut -c3- | paste - - | while read A B; do
        seq $A 1 $B
    done | sed 's/^/CT/' | sed 's/^/CU/'
}

# Loop through each accession range and download the sequences
for range in $accession_ranges; do
    # Create a directory to store fasta files for each accession range
    mkdir -p $range
    # Convert accession range to individual accession numbers
    convert_accession_range_to_list $range | while read accession; do
        # Download the sequence
        wget -O "$range/$accession.fasta" "https://www.ncbi.nlm.nih.gov/search/api/download-sequence/?db=nuccore&id=$accession"
    done
done
ADD REPLY
1
Entering edit mode

Don't forget to follow up on your threads. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work. If an answer was not really helpful or did not work, provide detailed feedback so others know not to use that answer.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Pierre Lindenbaum - I recently got some clarity on GIs from the esummary docs page, which clarifies that:

NCBI is no longer assigning GI numbers to a growing number of new sequence records. As such, these records are not indexed in Entrez, and so cannot be retrieved using ESearch or ESummary, and have no Entrez links accessible by ELink. EFetch can retrieve these records by including their accession.version identifier in the id parameter.

This had been a point of confusion for me so wanted to share !

VL

ADD REPLY

Login before adding your answer.

Traffic: 1509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6