Question: Illumina Bam Analysis
0
gravatar for win
6.6 years ago by
win810
India
win810 wrote:

Hi all, I downloaded a bam file (NA18507) from a Illumina site [ http://www.illumina.com/truseq/tru_resources/datasets.ilmn ]. Also downloaded the reference file from NCBI website as well as the 1K Genomes Reference Fasta file.

I used the samtools pipeline as follows:

samtools sort NA18507.bam NA18507.sorted
samtools mpileup -uf h:\RefGenome\human_g1k_v37.fasta NA18507.sorted.bam | bcftools view -bvcg - > NA18507.sorted.bam.raw.bcf
bcftools view NA18507.sorted.bam.raw.bcf | perl c:\users\a\desktop\samtools\vcfutils.pl varFilter -D100 > NA18507.VCF.vcf

Tried to use both the reference genome files.

The problem is that i cannot get any variants to show up in the VCF file. Can someone please help with this? Is there something obvious that I am missing.

Thanks in advance.

illumina • 2.6k views
ADD COMMENTlink modified 6.6 years ago by Matt Shirley9.0k • written 6.6 years ago by win810
1
gravatar for matted
6.6 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

Yes, there is something obvious that you're missing.

The chromosome names in the reference don't match the names in the BAM files. It's a bit weird (and unfriendly by Illumina); the chr19 and chr21 files use one format:

Chr19:

samtools view -H *chr19.sorted.bam
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr21:

samtools view -H *chr21.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:c1.faLN:249250621
@SQSN:c2.faLN:243199373
@SQSN:c3.faLN:198022430

Chr4:

samtools view -H *chr4.sorted.bam
[bam_header_read] EOF marker is absent.
@HDVN:1.0SO:coordinate
@PGID:CASAVAVN:CASAVA-1.8.0a2CL:/illumina/development/casava/CASAVA-1.8.0a2/bin/run.pl --projectDir /illumina/scratch/parsimony/builds/101117_P15_0775_AFC90070_build_1.8alpha2/95G --refSequences /lustre/isilon/Mondas_software/Genomes_STS/HumanNCBI37Fasta --sgeAuto --numberOfProcesses=100 --variantsPrintUsedAlleleCounts --bamChangeChromLabels UCSC --bamSkipRef --sortKeepAllReads --variantsWriteRealigned --targets all bam --runId=Run1 --exportDir=/illumina/scratch/parsimony/alignments/101117_P15_0775_AFC90070/main1.8alpha2/GERALD_2011-01-05_ibancarz/95G/ --lanes=2,3,4 --postRunCmd=/illumina/development/validation/bin/postRunScripts1.8/LaunchPostBuildFramework.sh -s NA18507_NCBI37 -e ibancarz@illumina.com -c /illumina/development/validation/config/config_hg19_longHyraxHead.xml
@SQSN:chr1LN:249250621
@SQSN:chr2LN:243199373
@SQSN:chr3LN:198022430

The 1000 Genomes format uses "1", "2", "3", etc. I'm not sure about NCBI, but in any case it won't match their chr19 and 21 formats.

You need to change the chromosome names in your reference files to match their naming scheme.

ADD COMMENTlink written 6.6 years ago by matted7.0k
0
gravatar for win
6.6 years ago by
win810
India
win810 wrote:

Any way that can be fixed?

ADD COMMENTlink written 6.6 years ago by win810
2

As a rule, you should put followup questions to answers as comments to the answer (like I'm doing right now).

ADD REPLYlink written 6.6 years ago by Matt Shirley9.0k

ok, will keep that in mind.

ADD REPLYlink written 6.6 years ago by win810
0
gravatar for Matt Shirley
6.6 years ago by
Matt Shirley9.0k
Cambridge, MA
Matt Shirley9.0k wrote:

As [matted] pointed out, you are using the GRCh37 reference from NCBI, which uses the numerical + X,Y, and MT annotations for chromosomes. You are using an aligned set of reads from Illumina, and they use UCSC hg19. You can download it from UCSC and use this utility to convert the 2bit file to fasta. After you have the fasta, you'll want to index it with faidx hg19.fasta and then use that as your reference genome for samtools and bcftools operations. You'll also want to make a few adjustments to the commands you're using:

samtools sort NA18507.bam NA18507.sorted.bam
samtools mpileup -uf hg19.fasta NA18507.sorted.bam | bcftools view -v - > NA18507.sorted.bam.vcf

Then take a look at the resulting VCF. It should contain only variant sites. If you want to filter this file using vcftools, you can invoke vcftools directly on the VCF, and there is really no need to work entirely in the bcf format unless you are dealing with many different samples, or outputting information about invariant sites.

ADD COMMENTlink written 6.6 years ago by Matt Shirley9.0k

Thank you, this did it.

ADD REPLYlink written 6.6 years ago by win810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1542 users visited in the last hour