Question

How to get human reference genome which is ready to use

0

Entering edit mode

6.9 years ago

AHW ▴ 90

I am looking for a human reference genome. I tried to download the genome from the link

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/

and the file I downloaded is hg38.fa. The purpose of downloading the reference genome is to align RNA-seq reads with the reference genome. I tried to look into the file hg38.fa after download and found that there are different chromosome heading at the start such as

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

and at the last I found something like

>chrY_KI270740v1_random
TAATAAATTTTGAAGAAAATGAAGACTGTGTTCTCAGTTCCAGGTGCTTC
ATCAGGCTCATTGTGGATCCAGACTACCAGACACAAGACATTACACATTG
TAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAG
AATTTTATAATGTTTGGAAAAATACATAGAGGCTTACTTTTTATTTTATT
TTTTTGAGATAGGAAGCCtttttttttgtttttgtttttgtttctgtttt
tgttttttgagacagagtctcaccatgtcacccagactggagtgcagtgg
tgcaatatcggcccattgcaagctccacatcccaggttcacaccattctc
ctgcctcagcctcccaagtagctgggactacaggtgcccgccaccacatc
cagctaatttttttttgtacttttagtagagacggggtatcaccatgtga
gccaagatggtctccatctcctgacctcgtgatctgcccaccttggcctc
ccaaagtgctgggattacaggggtgagccaccacgcccagGCATAGAGGC
ACTTTTAACCATAAATGAACACTGTTATGATTTGTATTACCACAGTATCA
TTATTCTGTCCTGTTTGCCTTACAttttatttatttattatactgtaagt
tctgggatacatgtgcagaatgtgcaggtttgttacagagatatatgctt
gtttgctgcacctgtcagtttttcatctacattaggtatttctcctaatg
ctattccctgttaggtccccaccctccaacagtctccagtgtttgatgtt
cccctccctatgtccatgtattctcattttacaactcccacctatgagtg
agaaattgcagtgtttgTGtgtttggaacttattccttccagtgggtttg
tggtctcgctcactgcaaaaatgaagctgtagaccgtttcggtgtgtgtt
acaactcttaaaggtggtgtgtctggagtttgctacttcacatgagctca
tggtcttgcttacttcaagaatgaagctgcagacatttacggtgagtgtt

I am not sure if I can use this reference genome as it is, and if any preprocessing is required before it's use for sequence read alignment. I would also like to know from where I can get RNA-seq reads, that can be used for alignment with this reference genome.

genome RNA-Seq alignment • 2.6k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 6.9 years ago by AHW ▴ 90

1

Entering edit mode

I would also like to know from where I can get RNA-seq reads.

From NCBI SRA/EBI ENA.

ADD REPLY • link 6.9 years ago by GenoMax 141k

1

Entering edit mode

6.9 years ago

Michael 54k

Just to add an explanation to what Devon wrote: What you find in an Assembly like the one you downloaded is not "the genome", but a model of the genome on a best effort basis with the intention to represent "the Genome" in a single linear string. There are a lot of imperfections in this approach like gaps, ambiguity (non ACGT characters), variation, alternative sequences, regions where there are repeats of sometimes unknown length, bits of sequence that could not be placed on a chromosome or not placed at an exact position like the >chrY_KI270740v1_random sequence. An indexing method needs to be able to deal with all or most of these. With all this variation and more sequenced individual genomes, a graph might be a much better representation of "the genome".

ADD COMMENT • link 6.9 years ago by Michael 54k

0

Entering edit mode

Thank you for the answer. From the above comments and answers, I got to know the importance of having the chromosomes as identifiers for sequence alignment process. As there are many faults in genome model such as gaps, ambiguity, repeated regions, random chromosome positions like >chrY_KI270740v1_random sequence and any aligner should take care of all this.

In general, what is sensible to do with all these situations:

Should I ignore any gap in the reference genome?
Should I consider repeated regions of the reference genome if any read falls there?
Are identifiers like >chrY_KI270740v1_random valid identifiers and should be reported in the alignment results like other valid identifiers?

ADD REPLY • link 6.9 years ago by AHW ▴ 90

1

Entering edit mode

The general strategy is to use a lower penalty when aligning against an N, since it's not clear whether it really constitutes a mismatch.
Yes
Yes

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

score 5 · Accepted Answer · 2017-05-20

5

Entering edit mode

6.9 years ago

Devon Ryan 104k

The only preprocessing you need is to index the genome using the instructions appropriate for the aligner you're going to use.

You can get RNAseq reads from the "european nucleotide archive" (ENA) or the "short read archive" (SRA). Just make sure to select human samples.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for your comments. What about the headers like chr1 and chrY_KI270740v1_random as shown above. Do you mean that they are ignored while creating an index by any aligner. Can I ignore characters like NNNN if I want and repetitions like tttttttttgtttttgtttttgtttctgtttttgttttttgagacagagtctcaccatgtcacccagactggagtgcagtgg. Can you also please comment why some are upper case and some lower case nucleotides .

ADD REPLY • link 6.9 years ago by AHW ▴ 90

1

Entering edit mode

Do you mean that they are ignored while creating an index by any aligner

No they are the identifiers for chromosomes/reference sequence and will be used for alignments. NNN's represent parts of the genome that have not been sequenced (due to limitations of current sequencing technologies) but are there. Repeats identified are generally shown in lower case characters (see this).

If you want to get ready to use aligner indexes the check out Illumina's iGenomes page. The downloads contain sequence, indexes for bwa/bowtie/bowtie2 and annotation bundles for many model organisms.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

If you want to get ready to use aligner indexes the check out Illumina's iGenomes page.

Actually I am not interested to use aligner indexes, but interested to built one by trying something different.

I think it depends on the aligner to consider these identifiers for alignment, and can simply be ignored if not needed, though I may be wrong.

ADD REPLY • link 6.9 years ago by AHW ▴ 90

1

Entering edit mode

Ignoring the identifiers would make the results largely worthless.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Well!! can you please provide me some links which discuss the role of identifiers for sequence alignment, as I am not a biologist.

ADD REPLY • link 6.9 years ago by AHW ▴ 90

1

Entering edit mode

Aligning without specifying the 'identifier' would be equivalent to give directions to your house by just giving the house number and not giving the street name.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

If you don't have a handle to point to as to where a particular bit of sequence you have identified as aligning to, how would be represent that in alignment results? A biologist needs to where (location/identifier) the alignment belongs to, along with coordinates that indicate the exact location on that 1D stretch of sequence.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Great , is it not enough to give the location where a particular read is aligning including forward or reverse strand information. For example:

Reference: AGCTGGCATGCAAAGTCAGTCAAATGCGTACGTCA
Read :     AGCTGGCAT

and then in the alignment results, it would be read is matching at location 0 of the reference genome and is forward strand, represented as +. And may be some more information like read name, read quality, number of copies and so on. What else is needed??

ADD REPLY • link 6.9 years ago by AHW ▴ 90

1

Entering edit mode

No, such a result without knowing exactly where that is in the genome is useless.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k