Question: editing genbank reference sequence
0
gravatar for mk.applebee
4.4 years ago by
mk.applebee20
United States
mk.applebee20 wrote:

I am doing an RNAseq study with isogenic strains of Bacillus subtilis.  The background strain I am using is very closely related to a well-annotated genome (strain 168), but has 3 important genes from strain 3610 inserted into the genome. Among other things, I am interested in transcription changes to these insertions and in their vicinity.

I am trying out both CLCworkbench(mapping)/DESEQ2(dif exp analysis) and Rockhopper to look for differential gene expression.  However, I have not yet been able to find a straightforward way to edit the genbank annotated reference sequence for B. subtilis available on NCBI (NC_000964).  What tools exist to facilitate editing of these files? Are there any that do not require knowledge of python or perl (but I am familiar with R).  This does not appear to be a trivial process.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by mk.applebee20

I don't know anything about Rockhopper, but DESEQ2 is not a mapping method. It takes as input raw counts of reads mapped to transcripts or genes, but you have to perform the mapping and counting apart and before feeding your data to DESEQ2.

I suppose you want to edit the reference to include the three additional genes. How are you mapping the reads? What is the file format of the reference? Do you know the positions the three genes are inserted?

One quick and dirty solution is to add three "contigs" with the three genes to your reference genome and perform the mapping and counting on this "extended" reference.

ADD REPLYlink written 4.4 years ago by h.mon28k

Yes, you are right, I forgot to mention that I was able to use CLC workbench to perform the mapping, and then exported the data to run using DESEQ2.  Rockhopper has its own integrated mapping.  Both use imported genbank files for references.

Actually, when I did the CLC genomics mapping, I did something very similar to what you said and added the extra sequences as though they were unconnected "contigs" or "chromosomes".  But I worry that I am losing some reads and some interesting information about expression at the integration junctions, and this tactic just seems inelegant if there is a straightforward way I am not aware of to simply generate the corrected genbank file.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by mk.applebee20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2235 users visited in the last hour