Question: Issue indexing reference genome FASTA files - index file does not contain chromosome numbers?
gravatar for
5.4 years ago by
United States
Irene@Sequencing.com250 wrote:

I've encountered an issue while indexing reference genomes - the file doesn't contain any chromosome numbers.

Several FASTA reference files from NCBI have sequence header information in the following format:

gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37. p13 Primary Assembly

The sequence header has the chromosome number embedded within. I tried to use the fasta file as a part of my sequencing pipeline but have encountered an issue: the aligners (BowTie2, Isaac, etc.) produce an index file which contains the contig names instead of chromosome number. Due to this the resultant VCF files produced using this indexed reference file also contains the contig data as the chromosome name.

It also seems that the reference FASTA files when used from UCSC contain the sequence header in the correct format such as: chr 1

and these UCSC FASTA files can be used without any issue (or need to reformat) via my sequencing pipeline (produces VCF files with chromosome numbers)

There are, however, many NCBI FASTA files I'd like to index as reference files that are not provided by UCSC - is there a way to format a FASTA file so that the chromosome number is extracted from the FASTA and reformatted and the FASTA file updated so that it can be used for indexing? (ie is in the UCSC style?)

reference fasta • 3.5k views
ADD COMMENTlink modified 5.4 years ago by Philipp Bayer6.8k • written 5.4 years ago by Irene@Sequencing.com250

Not tested: cat ref.fa | perl -pe 's/>.*chromosome (\S+).*/>chr$1/' > renamed.fa

ADD REPLYlink written 5.4 years ago by lh332k
gravatar for Philipp Bayer
5.4 years ago by
Philipp Bayer6.8k
Philipp Bayer6.8k wrote:

I'm not sure I understand you correctly, does your aligner report "gi|224514618|ref|NT_077402.2|" as the hit?

If so, that's one way FASTA files are defined - BioPython, for example, distinguishes "names" and "IDs" for sequences in FASTA files.  Everything before the space is the ID, everything after is the name, so for example the ID for your sequence is "gi|224514618|ref|NT_077402.2|", the name is "Homo sapiens chromosome 1 genomic contig, GRCh37. p13 Primary Assembly". By default (and I don't know right now how to override this, there doesn't seem to be an option for this) bowtie2-build just uses the ID.

The UCSC downloads have a much simpler ID and no name, the first human chromosome for example just has ">chr1", no space.

The laziest way to override this behavior is to replace spaces by "_", then it will report the entire name. Or you can write a small program that replaces the entire ID/name by the chromosome number, is that one always in the same position?

ADD COMMENTlink written 5.4 years ago by Philipp Bayer6.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1259 users visited in the last hour