Question: Finding out assembly reference from bam file
gravatar for cmcouto.silva
20 months ago by
São Paulo, Brazil.
cmcouto.silva30 wrote:

I am entirely newbie on WGS. At this point, I have worked just with data from SNP-array (plink format, .vcf, .haps/.sample), and I have to verify the allele frequency of a single SNP in WGS data. Therefore, I have to perform SNP calling, alright?

The thing is, I did download the data in .bam format from ENA. However, I do not know from which reference genome it was generated, and I do need a fasta reference to make the SNP calling via HaplotypeCaller from GATK. Anyone knows how do I get the reference genome (build version) in order the download the right fasta file?

I saw a pretty similar question at this post, where @matted wrote:

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

However, as I said I am a newbie. How can I infer the reference by chromosome names and the assembly version by the sizes? My purpose is really straightforward: just to see the allele frequency of a SNP in these data.

Any help will be very appreciated!

ADD COMMENTlink modified 20 months ago by biobiu120 • written 20 months ago by cmcouto.silva30

Can you provide a link for the data you downloaded? Sometimes data is supplied as unaligned BAM files instead of plain fastq.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax90k


ADD REPLYlink written 20 months ago by cmcouto.silva30

It has a paper linked, is the assembly/annotation mentioned there? (I cannot access it, due to paywall)

ADD REPLYlink written 20 months ago by michael.ante3.6k

Methods say this. So it was aligned to GRCh37:

Sequencing reads were trimmed for Illumina adaptors by using AdapterRemoval (60) and mapped to the human reference genome build 37 by using BWA v.0.6.2-r126 (61) with disabled seeding (−l parameter) (62). Reads with mapping quality lower than 30 were discarded, polymerase chain reaction duplicates were identified by using MarkDuplicates (63), and local realignment was carried out by using GATK


ADD REPLYlink modified 20 months ago • written 20 months ago by genomax90k

Yes, it did! My bad, sorry! I'm probably a little bit anxious. But all your tips were from great value!

ADD REPLYlink written 20 months ago by cmcouto.silva30

You may manually check few reads. Align them to both hg19 and hg38 and check the coordinates....

ADD REPLYlink written 20 months ago by biobiu120
gravatar for WouterDeCoster
20 months ago by
WouterDeCoster44k wrote:

Those things should be part of your bam header, which you can get using

samtools view -H youralignment.bam
ADD COMMENTlink written 20 months ago by WouterDeCoster44k

Yeah, I used this command, but there is no info regarding the assembly or reference used.

ADD REPLYlink written 20 months ago by cmcouto.silva30

But you get the names and sizes of the chromosomes.

ADD REPLYlink written 20 months ago by michael.ante3.6k

Wow, I just got it! So, I must look if the chromosome size from my file matches the one from the reference assembly (by chromosome). I did it by looking at Thanks!!

ADD REPLYlink written 20 months ago by cmcouto.silva30

That may be a bit risky but if you have no choice... You can always check the paper that accompanied this submission for that infor as well.

You can see the @PG lines from one of your BAM files. Many aligners will include the full command line used for alignment in one of these lines but looks like this one does not do that.

@PG     ID:MarkDuplicates       VN:1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[T/Lovelock2_mapped_filt_sorted.bam] OUTPUT=T/Lovelock2_mapped_filt_sorted_dedup.bam METRICS_FILE=T/Lovelock2_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=[temp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false PN:MarkDuplicates
@PG     ID:bwa  VN:0.6.1-r104   PN:bwa
@PG     ID:GATK IndelRealigner  CL:knownAlleles=[] targetIntervals=T/Lovelock2.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

If you want to be absolutely certain download the fastq files and align the data yourself.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 838 users visited in the last hour