Question: Finding out assembly reference from bam file
gravatar for cmcouto.silva
3 months ago by
cmcouto.silva30 wrote:

I am entirely newbie on WGS. At this point, I have worked just with data from SNP-array (plink format, .vcf, .haps/.sample), and I have to verify the allele frequency of a single SNP in WGS data. Therefore, I have to perform SNP calling, alright?

The thing is, I did download the data in .bam format from ENA. However, I do not know from which reference genome it was generated, and I do need a fasta reference to make the SNP calling via HaplotypeCaller from GATK. Anyone knows how do I get the reference genome (build version) in order the download the right fasta file?

I saw a pretty similar question at this post, where @matted wrote:

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

However, as I said I am a newbie. How can I infer the reference by chromosome names and the assembly version by the sizes? My purpose is really straightforward: just to see the allele frequency of a SNP in these data.

Any help will be very appreciated!

ADD COMMENTlink modified 3 months ago by biobiu90 • written 3 months ago by cmcouto.silva30

Can you provide a link for the data you downloaded? Sometimes data is supplied as unaligned BAM files instead of plain fastq.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k


ADD REPLYlink written 3 months ago by cmcouto.silva30

It has a paper linked, is the assembly/annotation mentioned there? (I cannot access it, due to paywall)

ADD REPLYlink written 3 months ago by michael.ante3.2k

Methods say this. So it was aligned to GRCh37:

Sequencing reads were trimmed for Illumina adaptors by using AdapterRemoval (60) and mapped to the human reference genome build 37 by using BWA v.0.6.2-r126 (61) with disabled seeding (−l parameter) (62). Reads with mapping quality lower than 30 were discarded, polymerase chain reaction duplicates were identified by using MarkDuplicates (63), and local realignment was carried out by using GATK


ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k

Yes, it did! My bad, sorry! I'm probably a little bit anxious. But all your tips were from great value!

ADD REPLYlink written 3 months ago by cmcouto.silva30

You may manually check few reads. Align them to both hg19 and hg38 and check the coordinates....

ADD REPLYlink written 3 months ago by biobiu90
gravatar for WouterDeCoster
3 months ago by
WouterDeCoster38k wrote:

Those things should be part of your bam header, which you can get using

samtools view -H youralignment.bam
ADD COMMENTlink written 3 months ago by WouterDeCoster38k

Yeah, I used this command, but there is no info regarding the assembly or reference used.

ADD REPLYlink written 3 months ago by cmcouto.silva30

But you get the names and sizes of the chromosomes.

ADD REPLYlink written 3 months ago by michael.ante3.2k

Wow, I just got it! So, I must look if the chromosome size from my file matches the one from the reference assembly (by chromosome). I did it by looking at Thanks!!

ADD REPLYlink written 3 months ago by cmcouto.silva30

That may be a bit risky but if you have no choice... You can always check the paper that accompanied this submission for that infor as well.

You can see the @PG lines from one of your BAM files. Many aligners will include the full command line used for alignment in one of these lines but looks like this one does not do that.

@PG     ID:MarkDuplicates       VN:1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[T/Lovelock2_mapped_filt_sorted.bam] OUTPUT=T/Lovelock2_mapped_filt_sorted_dedup.bam METRICS_FILE=T/Lovelock2_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=[temp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false PN:MarkDuplicates
@PG     ID:bwa  VN:0.6.1-r104   PN:bwa
@PG     ID:GATK IndelRealigner  CL:knownAlleles=[] targetIntervals=T/Lovelock2.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

If you want to be absolutely certain download the fastq files and align the data yourself.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 857 users visited in the last hour