I want to design a microarray probe set for Coccomyxa subellipsoidea C-169 which has been fully sequenced. I need to make a target sequence data file in FASTA format or a TDT file of GenBank accessions. I got a list of 10091 potential genes to design the microarray (NCBI-->Gene). How can I make this file? Do you know any step by step guideline that I can use? Thanks
Step 1: Gene a track from the UCSC table browser, which contains the following for the hg19 genome:
chromosome
txStart (transcription start)
txEnd (transcription end)
gene name
Step 2:
With that data in hand, and with a local copy of the hg19 reference genome, use samtools faidx along with the chromosome, start, and end positions to get the FASTA sequence for each gene. By simply launching a separate samtools process for each gene and capturing the output, you can programmatically build a FASTA containing all of the gene sequences
If you need clarification or coding help for any of these steps, just let me know what specific questions you have.
This python snippet, using BioPython, will pull the sequence record in genbank format from NCBI using an input file containing a list of accession numbers.
import sys
from Bio import Entrez
from Bio import SeqIO
Entrez.email = #put your email address here
input_list = sys.argv[1]
with open(input_list, "r") as f:
idlist = [line.strip() for line in f]
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text",id=idlist, email=Entrez.email)
seq_rec_list = []
for seq_record in SeqIO.parse(handle, "gb"):
seq_rec_list.append(seq_record)
out_handle = open("output.fasta", "w")
SeqIO.write(seq_rec_list, out_handle, "fasta")
out_handle.close()
Do you need to pull the genes themselves using identifiers or just get the transcript sequence from your fasta file?