Illumina Bam Analysis
3
0
Entering edit mode
11.5 years ago
win ▴ 970

Hi all, I downloaded a bam file (NA18507) from a Illumina site [ http://www.illumina.com/truseq/tru_resources/datasets.ilmn ]. Also downloaded the reference file from NCBI website as well as the 1K Genomes Reference Fasta file.

I used the samtools pipeline as follows:

samtools sort NA18507.bam NA18507.sorted
samtools mpileup -uf h:\RefGenome\human_g1k_v37.fasta NA18507.sorted.bam | bcftools view -bvcg - > NA18507.sorted.bam.raw.bcf
bcftools view NA18507.sorted.bam.raw.bcf | perl c:\users\a\desktop\samtools\vcfutils.pl varFilter -D100 > NA18507.VCF.vcf

Tried to use both the reference genome files.

The problem is that i cannot get any variants to show up in the VCF file. Can someone please help with this? Is there something obvious that I am missing.

Thanks in advance.

illumina • 3.8k views
ADD COMMENT
1
Entering edit mode
11.5 years ago
matted 7.8k

Yes, there is something obvious that you're missing.

The chromosome names in the reference don't match the names in the BAM files. It's a bit weird (and unfriendly by Illumina); the chr19 and chr21 files use one format:

Chr19:

samtools view -H *chr19.sorted.bam
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr21:

samtools view -H *chr21.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr4:

samtools view -H *chr4.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:chr1LN:249250621
@SQSN:chr2LN:243199373
@SQSN:chr3LN:198022430

The 1000 Genomes format uses "1", "2", "3", etc. I'm not sure about NCBI, but in any case it won't match their chr19 and 21 formats.

You need to change the chromosome names in your reference files to match their naming scheme.

ADD COMMENT
0
Entering edit mode
11.5 years ago
win ▴ 970

Any way that can be fixed?

ADD COMMENT
2
Entering edit mode

As a rule, you should put followup questions to answers as comments to the answer (like I'm doing right now).

ADD REPLY
0
Entering edit mode

ok, will keep that in mind.

ADD REPLY
0
Entering edit mode
11.5 years ago

As [matted] pointed out, you are using the GRCh37 reference from NCBI, which uses the numerical + X,Y, and MT annotations for chromosomes. You are using an aligned set of reads from Illumina, and they use UCSC hg19. You can download it from UCSC and use this utility to convert the 2bit file to fasta. After you have the fasta, you'll want to index it with faidx hg19.fasta and then use that as your reference genome for samtools and bcftools operations. You'll also want to make a few adjustments to the commands you're using:

samtools sort NA18507.bam NA18507.sorted.bam
samtools mpileup -uf hg19.fasta NA18507.sorted.bam | bcftools view -v - > NA18507.sorted.bam.vcf

Then take a look at the resulting VCF. It should contain only variant sites. If you want to filter this file using vcftools, you can invoke vcftools directly on the VCF, and there is really no need to work entirely in the bcf format unless you are dealing with many different samples, or outputting information about invariant sites.

ADD COMMENT
0
Entering edit mode

Thank you, this did it.

ADD REPLY

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6