Question: Create genome annotation files for a personal genome
gravatar for hyanwong
5.4 years ago by
United Kingdom
hyanwong70 wrote:

I'm pretty new to genome analysis, so apologies if this is a silly question. I have a 52x coverage personal genome in BAM format which I have converted to FASTA using the hs37d5 reference sequence using SAMtools etc. as detailed here. I can also download annotations mapped against the reference sequence such as gene tracks (e.g. GENCODE GTF files).

Say that I now want to use my personal genome as a reference. I assume that the locations on my new FASTA file won't correspond exactly to locations on the reference sequence: some introns will be of different lengths to the reference, etc. So what command-line magic to I need to do to convert the GTF (or other annotation) file so that it maps to my personal genome, rather than the reference? I assume this will involve throwing away some annotations (where the personal genome may have deletions), and there will probably be areas where the personal genome has extra duplicated genes, but I guess it will be difficult to detect these.

As it happens, I'd also like to display Alu repeats on my personal genome: these will presumably differ quite a lot from the reference. So are there any tools I can use to create an annotation track for repeated sequences against a personal genome sequence? Can I BLAST along my personal FASTA file and create an annotation track out of that?



map personal-genome fasta • 2.5k views
ADD COMMENTlink modified 5.2 years ago by Biostar ♦♦ 20 • written 5.4 years ago by hyanwong70

I think you posted an incorrect link. You linked to a seqanswers post on BAMseek, which is a BAM viewer and doesn't mention using samtools to create a reference sequence.

Assuming that you used the common samtools mpileup route to create the fasta, then the question is just whether that incorporates InDels. The answer to that is, "no, it skips indels", so the gencode annotation would still be correct (or at least correct enough).

ADD REPLYlink written 5.4 years ago by Devon Ryan95k

Thanks, and apologies for the user error (I've corrected it in my post). You are correct that I used mpileup

'samtools mpileup -uf hs37d5.fa.gz personal_genome.bam | bcftools view -cg - | vcf2fq > new.fq'

I didn't realise that it skipped indels. So (a) Is it possible to display the extra copies of genes / regions / tandem repeats / SINEs that are present in the personal genome but not in the reference? And (b) if so, do I need to change the annotation files to allow for this?


ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by hyanwong70

(A) Yes, but not with vcfutils (at least the last time I checked). Maybe the GATK version of this introduces InDels, I haven't checked. If not, you'd need to write something yourself.

(B) Yeah, any InDels will mess up the annotation files. The simplest method to take care of this might be to create a liftOver file and then try to liftOver the annotation file (I don't actually know if it can accept a GTF file). If that doesn't work, you're stuck writing something. This isn't actually that difficult and has the benefit that you can easily track all of the InDels, but that's something you'd have to write yourself.

BTW, unless you're a fairly experienced bioinformatician, tracking things like extra gene copies or inversions is going to be a nightmare. Unless you have a very good reason, make your life simpler and don't try to go past simple indels. Even those are non-trivial, since a deletion that causes a frame-shift in a gene near the beginning should really result in that gene being removed from the annotation.

BTW, as a general principle, it's typically easier to work in reference genome coordinates and just display how your personal genome is different in a track. That won't capture everything, of course, but it's a much simpler route.

ADD REPLYlink written 5.4 years ago by Devon Ryan95k

Thanks for the tips, and especially for the warning/advice. By the way, the reason for doing this is that when trying to show and explain personal genomes to others, I feel it takes an extra mental leap to explain that these are differences from a canonical sequence. Much better to be able to say "this is the exact sequence present in genome X, to the best of our knowledge", and mark it up where possible. I presume that as tools become more sophisticated and processing / disk space cheaper, people may start moving away from using variation from a single reference sequence, since that's not entirely appropriate in some cases (e.g. all the stuff missing from the ref seq that needed to be put in the decoy genome, plus things like inversions and chromosomal transpositions). But that requires a way to map markups on a known genome to a newly sequenced genome.

For example, I'm keen to look at the LW/MW opsin gene, which can exist in a tandemly repeated array of up to 1-14 copies. I thought that I could just do a search in my newly produced FASTA file for the first 50 bp of the gene, and look at the number of nearby hits, but by the sound of it, I can't do that easily. That can't be an unusual use-case.

Thanks again, Yan

ADD REPLYlink written 5.4 years ago by hyanwong70

Well, part of the issue is that you really need to pool multiple different analyses of the personal genome to be able to do this. For example, samtools and GATK won't give you information about CNVs (i.e., they won't tell you about opsin repeats). There are other programs that can do that (but they don't call variants), but that's an obvious limitation to your current route. Similarly, looking at inversions and such requires yet another program...

ADD REPLYlink written 5.4 years ago by Devon Ryan95k

Cheers. So if I can't do this easily with an arbitrary personal genome, is there any way to display e.g. extra opsin repeats, or possible locations of Epstein-Barr (which as I understand isn't in the reference seq) in well-known genomes such as James Watson, or Venter's HuRef?

ADD REPLYlink written 5.4 years ago by hyanwong70

It depends on whether someone has done the analysis for those yet. If so, then yes it's possible. BTW, regarding opsin repeats, you're more likely to just get a total number of repeats rather than their exact location with sequence (since getting this would be quite difficult).

ADD REPLYlink written 5.4 years ago by Devon Ryan95k


Maybe you should do a simple liftover of your annotation on your new genome using this kind of tool :

ADD REPLYlink written 5.4 years ago by Juke344.1k

Interesting, thanks.

ADD REPLYlink written 5.4 years ago by hyanwong70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1360 users visited in the last hour