Question

Extracting SNPs from a different genome version

0

Entering edit mode

8.1 years ago

emilyepuckett • 0

I have a list of SNPs identified using RADseq and mapped to the 6th Rattus norvegicus genome build (rn6). I would like to extract these SNPs from 12 whole genomes which were mapped to the 5th genome build (rn5). I was planning on: first using mpileup in samtools to convert the whole genome BAM to a VCF using the rn5 genome

samtools1.2 mpileup -v -o rat.vcf -f ./Rnor_5/rn5.fa rat.bam

Second, use LiftoverVCF in picard to convert the coordinates

java -jar /usr/local/picard-tools-2.1.0/picard.jar LiftoverVcf I=grat.vcf O=rat_lift6.vcf CHAIN=./Rnor_5/rn5ToRn6.over.chain.gz REJECT=rat_reject.vcf R=./Rnor_6/rn6.dict

Then use mpileup again with a BED list (-l) of the SNPs I want to extract after the files are in the correct genome annotation.

If there are better ways to solve this problem, I am open to suggestions.

However, the trouble I think I am running in to is that the headers for the rn5, rn6, and over.chain files are all different. For example chromosome 1 is alternately called:

User$ head rn5.dict
@HD VN:1.5  SO:unsorted
@SQ SN:chr1 LN:290094216    M5:9397441152f18c800456aa682b0935c6 UR:file:/Volumes/urbanevol/Rnor_5/rn5.fa

User$ head rn6.dict
@HD VN:1.5  SO:unsorted
@SQ SN:CM000072.5   LN:282763074    M5:67cb6bbed9d7e4434bd6d277e3f883c2 UR:file:/Volumes/urbanevol/Rnor_6/GCA_000001895.4_Rnor_6.0_genomic.fna

In the over.chain file chr1 for rn6 is called chr1 and not CM000072.5; example:

chain 23677882575 chr1 290094216 + 0 290094216 chr1 282763074 + 0 282761901 1

Finally, in the BAM file for the downloaded genomes, chr 1 is called 1:

User$ samtools1.2 view -H rat.bam | head
@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:290094216    UR:file:/nfs/srpipe_references/references/Rattus_norvegicus/Rnor_5.0/all/fasta/Rattus_norvegicus.Rnor_5.0.71.fasta  AS:Rnor 5.0 M5:9397441152f18c800456aa682b0935c6 SP:Rattus norvegicus

My question is: how can I reconcile these different header names coming from different sources? samtools repeats errors that chromosomes from the input BAM cannot be identified, although it outputs a VCF. When I run the Picard command above, I recieve this error:

Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

Kind thanks for any advice.

samtools picard liftover • 2.3k views

ADD COMMENT • link 8.1 years ago by emilyepuckett • 0

1

Entering edit mode

I don't have a full answer but it seems that the problem is that you have 3 different naming schemes not just the typical two.

Many tools can handle the 1 to chr1 conversion automatically but the 1 to CM000072.5 is more arbitrary. I would start with either editing the rn6.dict file and replacing the chromosome names with those that match the chain file or recomputing it with picard from the correct genome. Hopefully that fixes it. If not then the next step is to move the rat.bam file from 1 to chr1 as well.

(also this is a really nice writeup of the problem, an example of how questions should be asked)

ADD REPLY • link 8.1 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you. Looking at this again, you are correct in that I had mapped to the incorrect genome to produce the BAM file. I have fixed this and now do not have errors (with either headers or all Ns as the REF allele) in the production of a VCF.

The main problem was that the Rattus norvegicus rn6 genome has different chromosome names depending on if you download the genome from NCBI or UCSC. Only the rn6 downloaded from UCSC works with the overchain file from rn5 to rn6 provided by UCSC. I hope that is helpful to others.

ADD REPLY • link 8.1 years ago by emilyepuckett • 0