Question: How to build a custom reference genome (from a BED file)
gravatar for umn_bist
4.7 years ago by
umn_bist380 wrote:

Up to now, I've been aligning my RNAseq PE samples to a reference whole genome build from Ensembl/gencode. My work involves calling somatic variants at specific regions on chr1, 8, 11, 12, 17, 22.

I have a BED file of my genes of interest, but is it preferred to build a custom reference genome based on the BED file or to concatenate chr1, 8, 11, 12, 17, 22? If using BED is preferred, would I use fastafrombed function in bedtools?

I understand that my custom reference genome will not have unplaced contigs included but I'm curious how bad this will be if I'm only concerned with calling variants on known regions in the first place.

Curious to hear your thoughts and input.

rna-seq • 2.5k views
ADD COMMENTlink modified 4.7 years ago by mbio.kyle360 • written 4.7 years ago by umn_bist380

Why are you doing this? What advantage are you going to get (besides saving some time on alignments)?
But to answer your question the way you have described would be one way of doing it.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by GenoMax92k

It's mostly time. I use STAR aligner which is very fast compared to TopHat, but it's still taking about 20 minutes to map one paired-end sample. I have currently 200+ RNAseq files and any amount of time saving will help.

Would you say that it's safer to use a custom ref genome of entire chr 1, 8, 11, 12, 17, 22 then one based on highly specific BED file of my genes of interest? I'm particularly concerned that some reads are bound to misalign slightly which may not get pulled if I used a very specific custom ref genome (i.e. built from BED file).

ADD REPLYlink written 4.7 years ago by umn_bist380

Some reads are likely going to mis-align as soon as you stop including the entire genome. Before you dive in all the way you may want to compare the results for a couple of samples (alignment against the whole genome and the reduced representation as chromosomes/BED regions) to check on the differences.

ADD REPLYlink written 4.7 years ago by GenoMax92k

^This. Aligning against partial references is a bad idea, unless you're damn sure that all your input sequence comes from a a restricted region.

ADD REPLYlink written 4.7 years ago by Chris Miller21k

If time constraints is the issue you can also align everything in parallel modes at the expense of memory utilization to speed up the tasks. In this thread I see someone mentioning STAR to align 60 million reads at 4 mins. You can also take a look at fast methods transcripts assembly like Salmon if you are not restricted only to alignment based methods. In any case for SNP calling from RNA-Seq which is at times a bit tricky it would be interesting to use the region of interest bed file as used in GATK for tagret interest intervals for exome data analysis. If this can be modeled for your RNA-Seq in a way to find the read read intensities across your target regions and then make SNP calls only to those regions. I do not know if anyone did it but worth taking a try if that is possible.

ADD REPLYlink written 4.7 years ago by ivivek_ngs5.0k
gravatar for mbio.kyle
4.7 years ago by
United States
mbio.kyle360 wrote:

You could look into using RSEM (, it supports using a reference of just transcripts, and while it recommends using bowtie as its under the hood aligner, you can have it use STAR.

You may need to do some data transformation on your bed file, or just get the ensembl gtf file and pull out the gene records you want.

The build command would be like:

rsem-prepare-reference --star --gtf ensembl_subset.gtf hg19.fasta hg19-subset-star

Note that I have "hg19.fasta" as an input. With the gtf flag RSEM will read your gtf records, and get only the required sequence from the fasta. So you could instead do:

rsem-prepare-reference --star --gtf ensembl_subset.gtf chr1.fasta, chr8.fasta, chr11.fasta, chr12.fasta, chr17.fasta, chr22.fasta hg19-subset-star

Depending on how you have the sequence data

ADD COMMENTlink written 4.7 years ago by mbio.kyle360

Aligning against just transcripts feels like a bad idea too. We know that in most RNAseq prep, there is significant presence of non-exonic reads (introns not spliced out, intergenic read throughs, yadda yadda). If the algorithm is trying very hard to place those reads onto exons, I'd have to imagine that a non-trivial amount of them get wrongly mapped. Maybe this is something they address in their paper (full disclosure - I haven't read it!) but if not, I'd be concerned.

ADD REPLYlink written 4.7 years ago by Chris Miller21k

RSEM works over top of a vanilla aligner (by default bowtie). It does tweak some parameters of the aligner, mostly with a focus on having bowtie return multiple alignments per read (which are fed into the EM model). The main reason for this is assigning reads to different isoforms. So it would not force reads into exons any more then bowtie would.

ADD REPLYlink written 4.7 years ago by mbio.kyle360
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2148 users visited in the last hour