Question

Retrieve the FASTA transcript sequences of a list of NCBI genes

0

Entering edit mode

9.8 years ago

Sibs • 0

Hello everyone,

I want to design a microarray probe set for Coccomyxa subellipsoidea C-169 which has been fully sequenced. I need to make a target sequence data file in FASTA format or a TDT file of GenBank accessions. I got a list of 10091 potential genes to design the microarray (NCBI-->Gene). How can I make this file? Do you know any step by step guideline that I can use? Thanks

gene Probe-set RNA Microarray • 4.6k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Sibs • 0

0

Entering edit mode

Do you need to pull the genes themselves using identifiers or just get the transcript sequence from your fasta file?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.6 years ago by pawlowac ▴ 80

Ram · Answer 1 · 2015-04-06

Say you are able to extract the identifiers with one identifier/line (perhaps in a file). I am just using the echo to generate the list:

$ echo -e "A0QTF8\nA0QTG1\nA0QU63"
A0QTF8
A0QTG1
A0QU63

Then you can use the following one-liner to get the sequences:

$ echo -e "A0QTF8\nA0QTG1\nA0QU63" | while read l; do echo -e ">"$l"\n"$(curl -s http://www.uniprot.org/uniprot/$l.txt | awk '/SQ/,/\/\//{if ($0!~/^\/\// && $0!~/^SQ/) {gsub(" ","",$0); printf $0}}'); done
>A0QTF8
MDLINGMGTSPGYWRTPREPGNDHRRARLDVMAQRIVITGAGGMVGRVLADQAAAKGHTVLALTSSQCDITDEDAVRRFVANGDVVINCAAYTQVDKAEDEPERAHAVNAVGPGNLAKACAAVDAGLIHISTDYVFGAVDRDTPYEVDDETGPVNIYGRTKLAGEQAVLAAKPDAYVVRTAWVYRGGDGSDFVATMRRLAAGDGAIDVVADQVGSPTYTGDLVGALLQIVDGGVEPGILHAANAGVASRFDQARATFEAVGADPERVRPCGSDRHPRPAPRPSYTVLSSQRSAQAGLTPLRDWREALQDAVAAVVGATTDGPLPSTP
>A0QTG1
MSAAANAEHGAADRVEILPVPGLPEFRPGDDLVGSLAEAAPWLRDGDVLVVTSKVVSKCEGRIVAAPSDPEERDTLRRKLIDDEAVRVLARKGRTLITENAIGLVQAAAGVDGSNVGSTELALLPVDPDRSAATLREGLRERLGVTVGVVITDTMGRAWRTGQTDFAIGASGLTVLQGYAGSRDRHGNELVVTEVAVADEIAAAADLVKGKLTAIPVAVVRGLRLPDDGSTAHRLVRAGEDDLFWLGTAEAIELGRRQAQLLRRSVRRFSAEPVPHDAIEAAVGEALTAPAPHHTRPVRFVWVQDSETRTRLLDRMKEQWRADLTADGLDADAVDRRVARGQILYDAPELVIPFLVPDGAHSYPDDARTAAEHTMFTVAVGAAVQGLLVALAVRDIGSCWIGSTIFAADLVRAELELPDDWEPLGAIAIGYPEQTPQPLGPRDPVPTDELLVRK
>A0QU63
MTKKSASSNNKVVATNRKARHNYTILDTYEAGIVLMGTEVKSLREGQASLADAFATVDDGEIWLRNVHIAEYHHGTWTNHAPRRNRKLLLHRKQIDNLIGKIRDGNLTLVPLSIYFTDGKVKVELALARGKQAHDKRQDLARRDAQREVIRELGRRAKGKI

Best Wishes,
Umer

Ram · Answer 2 · 2015-04-05

0

Entering edit mode

9.6 years ago

Dan D 7.4k

Step 1: Gene a track from the UCSC table browser, which contains the following for the hg19 genome:

chromosome
txStart (transcription start)
txEnd (transcription end)
gene name

Step 2:

With that data in hand, and with a local copy of the hg19 reference genome, use samtools faidx along with the chromosome, start, and end positions to get the FASTA sequence for each gene. By simply launching a separate samtools process for each gene and capturing the output, you can programmatically build a FASTA containing all of the gene sequences

If you need clarification or coding help for any of these steps, just let me know what specific questions you have.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.6 years ago by Dan D 7.4k

0

Entering edit mode

You can directly tell the UCSC Genome Browser to give you the sequences instead of the genomic loci.

Just select 'sequence' as 'output format'.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.6 years ago by David Langenberger 11k

Ram · Answer 3 · 2015-04-06

This python snippet, using BioPython, will pull the sequence record in genbank format from NCBI using an input file containing a list of accession numbers.

import sys
from Bio import Entrez
from Bio import SeqIO

Entrez.email = #put your email address here

input_list = sys.argv[1]

with open(input_list, "r") as f:
        idlist = [line.strip() for line in f]

handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text",id=idlist, email=Entrez.email)

seq_rec_list = []
for seq_record in SeqIO.parse(handle, "gb"):
    seq_rec_list.append(seq_record)

out_handle = open("output.fasta", "w")
SeqIO.write(seq_rec_list, out_handle, "fasta")
out_handle.close()