Are the UCSC genome assemblies non-redundant? How do I get a non-redundant genome fasta?
1
0
Entering edit mode
3.7 years ago
rmartson • 0

I'm looking to have a single FASTA sequence for each chromosome in an organism, but if I check the sequences in panTro5.fa (chimp) that I've downloaded from UCSC I get a ton of ids like: chr10_NW_015973889v1_random, chr10_NW_015973890v1_random, etc.

What are these and how do I get rid of them? I don't have them in my hg38.fa (human) file because you can download all the chromosomes individually and then assemble them into one fasta, but I don't think you get that option with other genomes.

I need to use the genomes to find hits for viral LTR sequences and the number of hits is important so I don't want to get the same hit in the same region of the genome twice or more.

genome ucsc blast • 976 views
ADD COMMENT
1
Entering edit mode
3.7 years ago
h.mon 32k

These random regions hits you are getting are believed to be real (and different from each other), they are not assigned a proper location probably because they are flanked by (even more) repetitive regions. You can get chr_random on the human genoma as well, it depends from where you downloaded the fasta and if you (or someone else) post-processed the genome after download.

I would argue you will get a better number for viral LTR sequences using chr_random sequences, but it will be problematic to compare assemblies of different qualities.

ADD COMMENT
0
Entering edit mode

Alright, I'll download the fasta file with random regions for the human genome as well then.

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6