Question: Using ref.fa (with random contig) on bam file aligned without random contigs
gravatar for umn_bist
4.8 years ago by
umn_bist380 wrote:

So I was given aligned reads of liver cancer samples (without matching normals). After viewing the header I noticed that it had chr1-22, chrX, chrY, chrM (note LN: 16571, not LN: 16569). No random or unlocalized contigs.

My worry is that these contigs were aligned elsewhere. Another issue is, I don't know if I can sort these with a generic ucsc.hg19.fasta (that includes random and unlocalized contigs) so that I can  call (all) variants using samtools, bcftools.

Should I build my own reference, removing all the random/unlocalized contigs? Also I am only interested in calling variants in very specific regions (tp53, piwil1-4, setdb1). Is there a way of reducing the time to sort and is there a way I can use a 'shortened' reference assembly specific to my regions of interest? Thank you for your time and help.

EDIT: I should note that although I am working with cancer samples, I am not trying to call somatic variants. I just need a pileup of all (high quality) variants.


rna-seq bcftools samtools vcf • 1.4k views
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by umn_bist380

If your going to call variants - arguably the computationally hardest thing to do in sequencing genomics - you really don't want to leave anything to chance. Ideally, in my opinion, you would convert the BAM to uBAM and follow the GATK pipeline for re-mapping the reads to the ideal reference genome.

I mean, there is nothing stopping you from doing it the way you suggested - it's just... meh. Calling reliable SNPs is hard enough when you do everything perfectly.

ADD REPLYlink written 4.8 years ago by John12k

Thank you for your reply. Ideally I would like to do this, but I have limited time. Most likely I can revisit these samples to follow GATK workflow. For sorting and mpileup, can I use a reference that's reduced to specific regions of interest or do I need the entire assembly?

ADD REPLYlink modified 10 months ago by RamRS30k • written 4.8 years ago by umn_bist380

Honestly, I don't know because I have never done that :( If anything you want to use the full reference but only have the region of your BAM that covers the areas you are interested in. Maybe just try it and see if you're short on time?
Actually i'm a little confused what you mean by sorting too. To sort a BAM file you dont need anything other than the BAM - and chances are it's already sorted. Perhaps its something specific to bcftools that I know nothing about. Regardless, your best option is to try it and see. You will either get an error instantly, or it will just skip over the reference bits it doesn't need.

The more important question is if the genome you are using to call SNPs is using the same co-ordinate system for all chrXYZ regions as what is present in the BAM file. Same major version with all the same patches.

Honestly, I would invest 1 week mapping this properly and having all the variables under your control than jumping into it now and finding out in a month that everything you have done was built on sand.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by John12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1383 users visited in the last hour