Finding out assembly reference from bam file
1
0
Entering edit mode
3.3 years ago

I am entirely newbie on WGS. At this point, I have worked just with data from SNP-array (plink format, .vcf, .haps/.sample), and I have to verify the allele frequency of a single SNP in WGS data. Therefore, I have to perform SNP calling, alright?

The thing is, I did download the data in .bam format from ENA. However, I do not know from which reference genome it was generated, and I do need a fasta reference to make the SNP calling via HaplotypeCaller from GATK. Anyone knows how do I get the reference genome (build version) in order the download the right fasta file?

I saw a pretty similar question at this post, where @matted wrote:

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

However, as I said I am a newbie. How can I infer the reference by chromosome names and the assembly version by the sizes? My purpose is really straightforward: just to see the allele frequency of a SNP in these data.

Any help will be very appreciated!

bam reference genome snp calling • 2.0k views
2
Entering edit mode

0
Entering edit mode
1
Entering edit mode

It has a paper linked, is the assembly/annotation mentioned there? (I cannot access it, due to paywall)

1
Entering edit mode

Methods say this. So it was aligned to GRCh37:

Sequencing reads were trimmed for Illumina adaptors by using AdapterRemoval (60) and mapped to the human reference genome build 37 by using BWA v.0.6.2-r126 (61) with disabled seeding (−l parameter) (62). Reads with mapping quality lower than 30 were discarded, polymerase chain reaction duplicates were identified by using MarkDuplicates (63), and local realignment was carried out by using GATK

.

0
Entering edit mode

Yes, it did! My bad, sorry! I'm probably a little bit anxious. But all your tips were from great value!

0
Entering edit mode

You may manually check few reads. Align them to both hg19 and hg38 and check the coordinates....

4
Entering edit mode
3.3 years ago

Those things should be part of your bam header, which you can get using

samtools view -H youralignment.bam

0
Entering edit mode

Yeah, I used this command, but there is no info regarding the assembly or reference used.

3
Entering edit mode

But you get the names and sizes of the chromosomes.

0
Entering edit mode

Wow, I just got it! So, I must look if the chromosome size from my file matches the one from the reference assembly (by chromosome). I did it by looking at https://www.ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37. Thanks!!

3
Entering edit mode

That may be a bit risky but if you have no choice... You can always check the paper that accompanied this submission for that infor as well.

You can see the @PG lines from one of your BAM files. Many aligners will include the full command line used for alignment in one of these lines but looks like this one does not do that.

@PG     ID:MarkDuplicates       VN:1.128(c8e12338d226532b30e9ecdbf33180a073c3ffc7_1421081159)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[T/Lovelock2_mapped_filt_sorted.bam] OUTPUT=T/Lovelock2_mapped_filt_sorted_dedup.bam METRICS_FILE=T/Lovelock2_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR=[temp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false PN:MarkDuplicates
@PG     ID:bwa  VN:0.6.1-r104   PN:bwa
@PG     ID:GATK IndelRealigner  CL:knownAlleles=[] targetIntervals=T/Lovelock2.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null


If you want to be absolutely certain download the fastq files and align the data yourself.