Question

Are the UCSC genome assemblies non-redundant? How do I get a non-redundant genome fasta?

0

Entering edit mode

6.8 years ago

rmartson • 0

I'm looking to have a single FASTA sequence for each chromosome in an organism, but if I check the sequences in panTro5.fa (chimp) that I've downloaded from UCSC I get a ton of ids like: chr10_NW_015973889v1_random, chr10_NW_015973890v1_random, etc.

What are these and how do I get rid of them? I don't have them in my hg38.fa (human) file because you can download all the chromosomes individually and then assemble them into one fasta, but I don't think you get that option with other genomes.

I need to use the genomes to find hits for viral LTR sequences and the number of hits is important so I don't want to get the same hit in the same region of the genome twice or more.

genome ucsc blast • 1.5k views

ADD COMMENT • link 6.8 years ago by rmartson • 0

score 1 · Answer 1 · 2017-07-22

1

Entering edit mode

6.8 years ago

h.mon 35k

These random regions hits you are getting are believed to be real (and different from each other), they are not assigned a proper location probably because they are flanked by (even more) repetitive regions. You can get chr_random on the human genoma as well, it depends from where you downloaded the fasta and if you (or someone else) post-processed the genome after download.

I would argue you will get a better number for viral LTR sequences using chr_random sequences, but it will be problematic to compare assemblies of different qualities.

ADD COMMENT • link 6.8 years ago by h.mon 35k

0

Entering edit mode

Alright, I'll download the fasta file with random regions for the human genome as well then.

ADD REPLY • link 6.8 years ago by rmartson • 0