Making RefSeq in Windows
Dear All;

I'm wondering if there is any software to make a reference genome in Windows platform? Your guides would help me so much.

Thanks

What do you mean by "making a reference genome"? Are you referring to creating indexes so you can use an aligner to align against the genome or or are you referring to annotating a genome sequence to convert it into GenBank format so it can be uploaded to sequence databases.

Maybe it's better to say from the beginning. I'm working on intergenic regions using CLC and to import my own tracking I need reference genome as CLC wants. But the strains I'm working on does not have any reference genome. So I need to make it myself.

What did you use for the work you have already done in CLC? What kind of data are you working with? If you have enough sequence data you can try to assemble a reference using the assembly program in CLC.

I can tell something step-wise: CLC TRASFAC TFBS plugin provides two different protocol to extract putative TFBSs: genomic and classic. For my work the first one is better because I don't need the following intergenic surfing. To use genomic option we need two files. genome and the tracking on the feature we are looking for. The CLC does not have intergenic region track positioning by default so to import new track into its track list I nee two files first one is the GTF/GFF/GVF file which is includes the intergenic positions and the other file is reference genome.

Depending on the question you are trying to answer, a reference genome from a closely related strain/species might be applicable.

I'm working on Escherichia Coli ST131 which is a Multi-Drug Resistant Bacteria and the available reference genome is for E Coli K-12 which is a commensal (Non pathogenic) strain. Beside that, The first one have between 1 to 5 plasmids and the latter do not have any plasmid. I think it's more reliable to make my own reference genome.

E. coli ST131 genome is available here (click on fasta link to get the sequence). Large and small plasmids are also available with these accessions: HG941719 and HG941720.

Thanks, to be honest I have EC958, JJ1886, JJ1887 and lots of other strains but please tell me how I can use it as a reference genome and add my own tracking in CLC? you know, I'm newbie to reference genome handling and manipulations.

You should be able to import the fasta files using the import function in CLC. Use GFF to get the annotations.

Here is the page for EC958. You can find the fasta sequence and the GFF files. Look for other strains at the Ensembl Bacteria page.

I'm really thank you for your efforts to convince me and I really appreciate that, but I suggest do this task step by step by yourself as I've done. The CLC does not accept a typical fasta sequence to make a track list. It must be a reference genome. Unfortunately, as I've not worked on refseqs yet I don't know why. The steps are as the following: 1) Go to Import button. 2) Select Tracks 3) Here is the place of problems, you should have a reference genome. CLC does not accept a simple fasta file. by the way, EC958 is one of 35 strains I work on. I have all of their data including fasta and GTF/GFF files. The problem is reference genome.

As the last comment, The icon for an imported fasta genome, downloaded genome and reference genome are different in CLC.

CLC is a commercial program and you are already paying for support. I suggest that you contact CLC tech support for additional help.

You now know where you can find (some of the) genomes and the annotations that go with them.

Have you tried with a genbank file? This is the link to the genbank entry of E coli ST131: http://www.ncbi.nlm.nih.gov/nuccore/HG941718.1. You can download the genbank file and convert it to tracks on CLC.

Thanks for your answer. You know all of them have a predefined set of features in tracking list including Genes, CDS, Transcripts and so on. I want to make my custom tracking. So I MUST use the import/tracks option to do so. You can see that now we are in middle of the problem again.

I really don't understand why you need to make a custom track. Have you seen the manual on tracks? http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Retrieving_reference_data_tracks.html

I should mention that the integrated tool for downloading reference genomes suggests the special strain of E coli which is really different from the one we are working on. The second option in your link is the one I expect, and it is the option that need reference genome. Our goal is to surf the intergenic regions so we need to use custom tracking to extract these pieces structurally. Because of comparison tools which are provided in CLC its more convenient to apply this custom tracking in CLC. Otherwise, we have some useful and of course powerful modules in Python to work with.

In that case (option 2 in my link), you can simply convert the fasta/genbank file of the genome of interest (e.g. E coli ST131) to a track, which will be called "something (genome)", and then use the resulting track as the reference genome track.

OK that's the case, how I can do it? you know, I want to see my custom track in track list in CLC, hope the thing you said would be the thing we are looking for.

You mean how to convert a fasta file to track? Here is the manual: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Convert_tracks.html. So, for example, if you already have the fasta file of E coli ST131 genome imported to CLC, right click on it, Toolbox -> Track Tools -> Convert to Tracks. In the popping window, just tick "Create sequence track" to create a reference sequence track. Then click next, and don't forget to save it somewhere, and that's it. After that, you can use option 2 in my previous link by importing GTF/GFF file and using the reference sequence track that you just made as the reference.

OK, we're getting to my problem. To use "Convert to Tracks" after selecting the genome of interest, in the next window you should select the type of feature (track) you're interested. Here there are something like CDS, Gene, Transcripts and so on. Please pay attention to my problem. I don't want any of them. I want to see my CUSTOM TRACK here in this list. How can I insert my custom track in here? As I see I couldn't clarify my problem so sufficiently up to now. I'm really thank you for your helps but think about my problem more.

My previous comment was to tell you how to convert a fasta file of the genome to a reference track. Your question from the beginning was how to have this reference genome, right? As I've mentioned, just tick "Create sequence track" and ignore (un-tick) CDS, gene, transcripts, etc. The result is a reference genome track. After that, you can create your custom track from GTF/GFF file and the reference genome track that you have just created. Your GTF/GFF file includes the intergenic positions, doesn't it? Later, you can use your custom track of intergenic regions and a reference track for transfac analysis using the genomic mode. I'm sorry if I misunderstood your question.

You're the best. Believe me. I'm really thank you for your help. But we are at the middle of the way. Based on your instructions, I made a genome track which import/tracks can recognize it as a reference genome but I can not find it in track list yet. You know the output of "Convert to Tracks" is a file with and yellow-arrow icon in CLC. To make this track we have a list including CDS, Gene and so on to select. Now the question is how I can add my track to this list?

I don't understand your question. What do you mean adding "your" track to "this" list? Which track? What list?

And by the way, a reference genome track should have a similar icon to a sequence (fasta, genbank) file, that is a red horizontal double helix, but with 3 small blue vertical stripes under it. Do you have this already?

If you do have it, import your GTF/GFF file by doing this: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_tracks.html#sec:importtracksfromfile and use your reference genome track as the reference track. The result is a custom track of the intergenic region of the genome (assuming that your GTF/GFF file contains the intergenic region). I suppose this custom track will have a yellow arrow icon (with 3 blue stripes since it's a track).

Only after that, you can use the transfac tfbs plugin with the "Genomic" mode of analysis, using the custom track as "regions" under "Genomic regions and reference selection" and the same reference track as before.

