Question: Best Practices For Naming De Novo Transcriptome Sequences?
5.4 years ago by
Burlington, VT
What are best practices for naming fasta sequences from a de novo transcriptome assembly?

Specifically, I'm thinking about

  • naming sequence that is intuitive and logical
  • forward-compatibility in the case of future genome-sequencing or more RNAseq
  • useful for other researchers performing data-base mining or such

I realize this may just end-up being project specific, but I'm hoping to avoid the problem of unstructured text in biological databases down the road.

5.4 years ago by
Damian Kao15k
As long as you delimited your headers correctly, any future manipulations to conform to another format should be easy. I would make sure to:

  • Choose a sensible delimiter. Obviously something you will not use in your meta-data. Characters like tabs or pipes are used commonly.
  • Have the same amount of delimited meta-data for each header
  • If certain meta-data is not applicable or available, make sure to put an empty place-holder like "NA" or something
  • If you have incrementing numbers, pad the numbers with starting zeroes so all numbers have the same string length. For example: 00001, 00002, 01234, 12345
Padding zeroes certainly important!

5.4 years ago by
Concord NC USA
My advice: Create names that can be easily parsed. For example, if your de novo assembly generates multiple transcript variants per locus, then use ".N" suffixes to indicate alternative transcripts coming from the same gene. And if you intend to make the sequences available as part of a searchable Web site, use names that are likely to be unique to your species. For example, for Vaccinium corymbosum (blueberry) you might do something like:

Vc1.1 for gene Vc1, transcript 1.

Do a quick google search to find out what your proposed names will bring up.

Sensible approach to dealing with transcripts, though in the absence of a genome this is one of the aspects of de novo transcriptome assembly I'm least comfortable with.

