How to cite genome assemblies?
1
1
Entering edit mode
9.1 years ago
tyler.weirick ▴ 120

Part of a paper I am writing involves comparing different human genome assemblies. I would like to have some kind of citation for the assemblies hg18, hg19, and hg38. It seems like many other papers do not cite them, for example http://nar.oxfordjournals.org/content/early/2010/10/18/nar.gkq963.full. However, I noticed some conflicting info on various database entries for the genomes and would like to know which information to use. For example, the release dates differ for hg19 on the NCBI assembly database verses the Genome Reference Consortium page February 27, 2009 vs March 3, 2009.

http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/index.shtml (click CRCh37)

Assembly citation • 9.0k views
ADD COMMENT
3
Entering edit mode
9.1 years ago
keith ▴ 130

Citing versions of any particular bioinformatics/genomics resources can get tricky because there is often no formal publication for every release of a given dataset. Further complicating the situation is the fact that you will often come across different dates (and even names) for the same resource. E.g. the latest cow genome assembly generated by the University of Maryland is known as 'UMD 3.1.1'. However, the UCSC genome browser uses their own internal IDs for all cow genome assemblies and refers to this as 'bosTau8'. Someone new to the field might see the UCSC version and not know about the original UMD name.

Sometimes you can use dates of files on FTP sites to approximately date sequence files, but these can sometimes change (sometimes files accidentally get removed and replaced from backups, which can change their date).

The key thing to aim for is to provide suitable information so that someone can reproduce your work. In my mind, this requires 2-3 pieces of information:

  1. The name or release number of the dataset you are downloading (provide alternate names when known)
  2. The specific URL for the website or FTP site that you used to download the data
  3. The date on which you downloaded the data

E.g. The UMD 3.1.1 version of the cow genome assembly (also known as bosTau8) was downloaded from the UCSC Genome FTP site (ftp://hgdownload.cse.ucsc.edu//apache/htdocs/goldenPath/bosTau8/bigZips/bosTau8.fa.gz).

When no version number is available - it is very unhelpful not to provide version numbers of sequence resources: they can, and will change - I always refer to the date that I downloaded it instead.

ADD COMMENT

Login before adding your answer.

Traffic: 2581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6