Question: Mouse strains and Reference genome choice
1
gravatar for tiago211287
4.2 years ago by
tiago2112871.1k
USA
tiago2112871.1k wrote:

I have RNA-seq data from the BALB/c mouse strain.

Looking for the reference genome on Ensembl, I found that, the most recent version, GRCm38 was build using the C57BL/6J strain.

I suppose that, the PATCH files contains haplotypes and variation also from other strain, like balb/c.

Are these information on the primary assembly file?

If I want to, How can I use the patch files for building the index? Just concatenate the files?

Or use the toplevel file for indexing?

Thank you.

mouse reference strains genome • 3.7k views
ADD COMMENTlink modified 4.2 years ago by Ashutosh Pandey11k • written 4.2 years ago by tiago2112871.1k
3
gravatar for Ashutosh Pandey
4.2 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

If your goal is to reduce the allele bias in RNA-seq read mapping then using PATCH files won't be of much help. You will have to generate a strain-specific reference genome in order to align your reads or use some sensitive aligner. BALB/c has been sequenced as a part of Mouse Genome Project and the variant calls (VCF) can be downloaded from the following page: http://www.sanger.ac.uk/resources/mouse/genomes/. There are around 4 million SNPs and around 0.8 million indels between C57BL/6J and BALB/c. Although most of these variants fall into the intergenic regions but it would be a god practice to try to align reads in a haplotype-sensitive manner. You will have to create a customize reference genome by substituting small SNPs and Indels, and then perform the alignment. Sanger provides a big vcf file for all the strains, so you will have to 1) first extract variants for BALB/c strain and 2) then substitute them into the reference genome. There are many relevant posts on Biostars that have discussed both of these steps in a detailed manner. 

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Ashutosh Pandey11k

Thank you. Can you post the relevant posts on Biostars discussing these steps?

I dont know how to start substituting the variants in the reference.

 

ADD REPLYlink written 4.2 years ago by tiago2112871.1k
1

FastaAlternateReferenceMaker

ADD REPLYlink written 4.2 years ago by geek_y9.9k

Great tool. Did not know until now. Thank you.

ADD REPLYlink written 4.2 years ago by tiago2112871.1k
1

You can use vcf-subset in order to extract variants for a particular strain from the big VCF file from MGP or you can read this post: A: Where To Download Mouse Mm10 Dbsnp Database With Vcf Format. Once you have the vcf file, you can use FastaAlternateReferenceMaker as Goutham suggested. Be aware that FastaAlternateReferenceMaker will not create a modified "GTF" file for you with new coordinates which is important if you are also substituting indels in the reference genome. You can use Personal Genome Constructor (http://alleleseq.gersteinlab.org/tools.html) from Gerstein Lab that will also output a modified "GTF" file. However, if this is your first time dealing with all this, you may only substitute SNPs in the reference genome. This way you will be able to use the original GTF file as substituting SNPs wont change the positions of transcripts. 

ADD REPLYlink written 4.2 years ago by Ashutosh Pandey11k

I followed your advice but get stucked in an error with gatk. Try to solve and find another problem.

I posted the problem in here:

Error when trying to fix the contigs order in the reference and vcf for FastaAlternateReferenceMaker

 

ADD REPLYlink written 4.2 years ago by tiago2112871.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 910 users visited in the last hour