Question

Illumina Bam Analysis

0

Entering edit mode

11.5 years ago

win ▴ 970

Hi all, I downloaded a bam file (NA18507) from a Illumina site [ http://www.illumina.com/truseq/tru_resources/datasets.ilmn ]. Also downloaded the reference file from NCBI website as well as the 1K Genomes Reference Fasta file.

I used the samtools pipeline as follows:

samtools sort NA18507.bam NA18507.sorted
samtools mpileup -uf h:\RefGenome\human_g1k_v37.fasta NA18507.sorted.bam | bcftools view -bvcg - > NA18507.sorted.bam.raw.bcf
bcftools view NA18507.sorted.bam.raw.bcf | perl c:\users\a\desktop\samtools\vcfutils.pl varFilter -D100 > NA18507.VCF.vcf

Tried to use both the reference genome files.

The problem is that i cannot get any variants to show up in the VCF file. Can someone please help with this? Is there something obvious that I am missing.

Thanks in advance.

illumina • 3.8k views

ADD COMMENT • link updated 11.5 years ago by Matt Shirley 10k • written 11.5 years ago by win ▴ 970

score 1 · Answer 1 · 2012-10-19

Yes, there is something obvious that you're missing.

The chromosome names in the reference don't match the names in the BAM files. It's a bit weird (and unfriendly by Illumina); the chr19 and chr21 files use one format:

Chr19:

samtools view -H *chr19.sorted.bam
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr21:

samtools view -H *chr21.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr4:

samtools view -H *chr4.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:chr1LN:249250621
@SQSN:chr2LN:243199373
@SQSN:chr3LN:198022430

The 1000 Genomes format uses "1", "2", "3", etc. I'm not sure about NCBI, but in any case it won't match their chr19 and 21 formats.

You need to change the chromosome names in your reference files to match their naming scheme.

score 0 · Answer 2 · 2012-10-20

0

Entering edit mode

11.5 years ago

win ▴ 970

Any way that can be fixed?

ADD COMMENT • link 11.5 years ago by win ▴ 970

2

Entering edit mode

As a rule, you should put followup questions to answers as comments to the answer (like I'm doing right now).

ADD REPLY • link 11.5 years ago by Matt Shirley 10k

0

Entering edit mode

ok, will keep that in mind.

ADD REPLY • link 11.5 years ago by win ▴ 970

score 0 · Answer 3 · 2012-10-20

As [matted] pointed out, you are using the GRCh37 reference from NCBI, which uses the numerical + X,Y, and MT annotations for chromosomes. You are using an aligned set of reads from Illumina, and they use UCSC hg19. You can download it from UCSC and use this utility to convert the 2bit file to fasta. After you have the fasta, you'll want to index it with faidx hg19.fasta and then use that as your reference genome for samtools and bcftools operations. You'll also want to make a few adjustments to the commands you're using:

samtools sort NA18507.bam NA18507.sorted.bam
samtools mpileup -uf hg19.fasta NA18507.sorted.bam | bcftools view -v - > NA18507.sorted.bam.vcf

Then take a look at the resulting VCF. It should contain only variant sites. If you want to filter this file using vcftools, you can invoke vcftools directly on the VCF, and there is really no need to work entirely in the bcf format unless you are dealing with many different samples, or outputting information about invariant sites.