I have a list of SNPs identified using RADseq and mapped to the 6th Rattus norvegicus genome build (rn6). I would like to extract these SNPs from 12 whole genomes which were mapped to the 5th genome build (rn5). I was planning on: first using mpileup in samtools to convert the whole genome BAM to a VCF using the rn5 genome
samtools1.2 mpileup -v -o rat.vcf -f ./Rnor_5/rn5.fa rat.bam
Second, use LiftoverVCF in picard to convert the coordinates
java -jar /usr/local/picard-tools-2.1.0/picard.jar LiftoverVcf I=grat.vcf O=rat_lift6.vcf CHAIN=./Rnor_5/rn5ToRn6.over.chain.gz REJECT=rat_reject.vcf R=./Rnor_6/rn6.dict
Then use mpileup again with a BED list (-l) of the SNPs I want to extract after the files are in the correct genome annotation.
If there are better ways to solve this problem, I am open to suggestions.
However, the trouble I think I am running in to is that the headers for the rn5, rn6, and over.chain files are all different. For example chromosome 1 is alternately called:
User$ head rn5.dict
@HD VN:1.5 SO:unsorted
@SQ SN:chr1 LN:290094216 M5:9397441152f18c800456aa682b0935c6 UR:file:/Volumes/urbanevol/Rnor_5/rn5.fa
User$ head rn6.dict
@HD VN:1.5 SO:unsorted
@SQ SN:CM000072.5 LN:282763074 M5:67cb6bbed9d7e4434bd6d277e3f883c2 UR:file:/Volumes/urbanevol/Rnor_6/GCA_000001895.4_Rnor_6.0_genomic.fna
In the over.chain file chr1 for rn6 is called chr1 and not CM000072.5; example:
chain 23677882575 chr1 290094216 + 0 290094216 chr1 282763074 + 0 282761901 1
Finally, in the BAM file for the downloaded genomes, chr 1 is called 1:
User$ samtools1.2 view -H rat.bam | head
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:290094216 UR:file:/nfs/srpipe_references/references/Rattus_norvegicus/Rnor_5.0/all/fasta/Rattus_norvegicus.Rnor_5.0.71.fasta AS:Rnor 5.0 M5:9397441152f18c800456aa682b0935c6 SP:Rattus norvegicus
My question is: how can I reconcile these different header names coming from different sources? samtools repeats errors that chromosomes from the input BAM cannot be identified, although it outputs a VCF. When I run the Picard command above, I recieve this error:
Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
Kind thanks for any advice.
I don't have a full answer but it seems that the problem is that you have 3 different naming schemes not just the typical two.
Many tools can handle the
1
tochr1
conversion automatically but the1
toCM000072.5
is more arbitrary. I would start with either editing thern6.dict
file and replacing the chromosome names with those that match the chain file or recomputing it with picard from the correct genome. Hopefully that fixes it. If not then the next step is to move therat.bam
file from1
tochr1
as well.(also this is a really nice writeup of the problem, an example of how questions should be asked)
Thank you. Looking at this again, you are correct in that I had mapped to the incorrect genome to produce the BAM file. I have fixed this and now do not have errors (with either headers or all Ns as the REF allele) in the production of a VCF.
The main problem was that the Rattus norvegicus rn6 genome has different chromosome names depending on if you download the genome from NCBI or UCSC. Only the rn6 downloaded from UCSC works with the overchain file from rn5 to rn6 provided by UCSC. I hope that is helpful to others.