Question

editing genbank reference sequence

0

Entering edit mode

8.8 years ago

mk.applebee ▴ 20

I am doing an RNAseq study with isogenic strains of Bacillus subtilis. The background strain I am using is very closely related to a well-annotated genome (strain 168), but has 3 important genes from strain 3610 inserted into the genome. Among other things, I am interested in transcription changes to these insertions and in their vicinity.

I am trying out both CLCworkbench(mapping)/DESEQ2(diff. exp analysis) and Rockhopper to look for differential gene expression. However, I have not yet been able to find a straightforward way to edit the genbank annotated reference sequence for B. subtilis available on NCBI (NC_000964). What tools exist to facilitate editing of these files? Are there any that do not require knowledge of python or perl (but I am familiar with R). This does not appear to be a trivial process.

annotation genbank alignment sequencing RNA-Seq • 2.1k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by mk.applebee ▴ 20

0

Entering edit mode

I don't know anything about Rockhopper, but DESEQ2 is not a mapping method. It takes as input raw counts of reads mapped to transcripts or genes, but you have to perform the mapping and counting apart and before feeding your data to DESEQ2.

I suppose you want to edit the reference to include the three additional genes. How are you mapping the reads? What is the file format of the reference? Do you know the positions the three genes are inserted?

One quick and dirty solution is to add three "contigs" with the three genes to your reference genome and perform the mapping and counting on this "extended" reference.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.8 years ago by h.mon 35k

0

Entering edit mode

Yes, you are right, I forgot to mention that I was able to use CLC workbench to perform the mapping, and then exported the data to run using DESEQ2. Rockhopper has its own integrated mapping. Both use imported genbank files for references.

Actually, when I did the CLC genomics mapping, I did something very similar to what you said and added the extra sequences as though they were unconnected "contigs" or "chromosomes". But I worry that I am losing some reads and some interesting information about expression at the integration junctions, and this tactic just seems inelegant if there is a straightforward way I am not aware of to simply generate the corrected genbank file.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.8 years ago by mk.applebee ▴ 20