Question

preprocessing genome fasta file prior to mapping?

0

Entering edit mode

7.9 years ago

lstbl ▴ 40

Hi Everyone,

Sorry if this is a dup, but I can't seem to find a satisfactory answer on this site or others.

I'm wondering what, if any, pre-processing I should perform on a reference genome fasta/gff file prior to mapping using BWA or Bowtie. For example, if I wanted to map something to the orangutan genome, should I remove entries that are labeled as "unplaced/unlocalized genomic scaffold" from the gff and fasta files--i.e. only map to canonical chromosomes.

I notice that even these scaffolds have "BestRefSeq" categories in the gff file for genes, indicating that they still have useful information on them.

The reason I ask is because I was told by someone who no doubt knows much more than me about this stuff that I SHOULD remove these chromosomes. I'm wondering, however if this person is wrong.

Thanks!

next-gen bwa sequencing bowtie • 1.6k views

ADD COMMENT • link 7.9 years ago by lstbl ▴ 40

score 3 · Accepted Answer · 2016-05-23

That is your choice.
If the sequence is known to belong to the genome (but is unplaced at the moment) then it should remain. You can ignore reads that align there if you are not interested in that region. Omitting the region from the reference may force an aligner to align those reads elsewhere, which you would probably do not want.