Question: Are the UCSC genome assemblies non-redundant? How do I get a non-redundant genome fasta?
23 months ago by
rmartson0 wrote:

I'm looking to have a single FASTA sequence for each chromosome in an organism, but if I check the sequences in panTro5.fa (chimp) that I've downloaded from UCSC I get a ton of ids like: chr10_NW_015973889v1_random, chr10_NW_015973890v1_random, etc.

What are these and how do I get rid of them? I don't have them in my hg38.fa (human) file because you can download all the chromosomes individually and then assemble them into one fasta, but I don't think you get that option with other genomes.

I need to use the genomes to find hits for viral LTR sequences and the number of hits is important so I don't want to get the same hit in the same region of the genome twice or more.

23 months ago by
h.mon25k wrote:

These random regions hits you are getting are believed to be real (and different from each other), they are not assigned a proper location probably because they are flanked by (even more) repetitive regions. You can get chr_random on the human genoma as well, it depends from where you downloaded the fasta and if you (or someone else) post-processed the genome after download.

I would argue you will get a better number for viral LTR sequences using chr_random sequences, but it will be problematic to compare assemblies of different qualities.

Alright, I'll download the fasta file with random regions for the human genome as well then.

