Question: How to get human reference genome which is ready to use
0
gravatar for AHW
3.2 years ago by
AHW40
India
AHW40 wrote:

I am looking for a human reference genome. I tried to download the genome from the link

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/

and the file I downloaded is hg38.fa. The purpose of downloading the reference genome is to align RNA-seq reads with the reference genome. I tried to look into the file hg38.fa after download and found that there are different chromosome heading at the start such as

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

and at the last I found something like

>chrY_KI270740v1_random
TAATAAATTTTGAAGAAAATGAAGACTGTGTTCTCAGTTCCAGGTGCTTC
ATCAGGCTCATTGTGGATCCAGACTACCAGACACAAGACATTACACATTG
TAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAG
AATTTTATAATGTTTGGAAAAATACATAGAGGCTTACTTTTTATTTTATT
TTTTTGAGATAGGAAGCCtttttttttgtttttgtttttgtttctgtttt
tgttttttgagacagagtctcaccatgtcacccagactggagtgcagtgg
tgcaatatcggcccattgcaagctccacatcccaggttcacaccattctc
ctgcctcagcctcccaagtagctgggactacaggtgcccgccaccacatc
cagctaatttttttttgtacttttagtagagacggggtatcaccatgtga
gccaagatggtctccatctcctgacctcgtgatctgcccaccttggcctc
ccaaagtgctgggattacaggggtgagccaccacgcccagGCATAGAGGC
ACTTTTAACCATAAATGAACACTGTTATGATTTGTATTACCACAGTATCA
TTATTCTGTCCTGTTTGCCTTACAttttatttatttattatactgtaagt
tctgggatacatgtgcagaatgtgcaggtttgttacagagatatatgctt
gtttgctgcacctgtcagtttttcatctacattaggtatttctcctaatg
ctattccctgttaggtccccaccctccaacagtctccagtgtttgatgtt
cccctccctatgtccatgtattctcattttacaactcccacctatgagtg
agaaattgcagtgtttgTGtgtttggaacttattccttccagtgggtttg
tggtctcgctcactgcaaaaatgaagctgtagaccgtttcggtgtgtgtt
acaactcttaaaggtggtgtgtctggagtttgctacttcacatgagctca
tggtcttgcttacttcaagaatgaagctgcagacatttacggtgagtgtt

I am not sure if I can use this reference genome as it is, and if any preprocessing is required before it's use for sequence read alignment. I would also like to know from where I can get RNA-seq reads, that can be used for alignment with this reference genome.

rna-seq alignment genome • 1.3k views
ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 3.2 years ago by AHW40
1

I would also like to know from where I can get RNA-seq reads.

From NCBI SRA/EBI ENA.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax87k
5
gravatar for Devon Ryan
3.2 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

The only preprocessing you need is to index the genome using the instructions appropriate for the aligner you're going to use.

You can get RNAseq reads from the "european nucleotide archive" (ENA) or the "short read archive" (SRA). Just make sure to select human samples.

ADD COMMENTlink written 3.2 years ago by Devon Ryan96k

Thank you for your comments. What about the headers like chr1 and chrY_KI270740v1_random as shown above. Do you mean that they are ignored while creating an index by any aligner. Can I ignore characters like NNNN if I want and repetitions like tttttttttgtttttgtttttgtttctgtttttgttttttgagacagagtctcaccatgtcacccagactggagtgcagtgg. Can you also please comment why some are upper case and some lower case nucleotides .

ADD REPLYlink written 3.2 years ago by AHW40
1

Do you mean that they are ignored while creating an index by any aligner

No they are the identifiers for chromosomes/reference sequence and will be used for alignments. NNN's represent parts of the genome that have not been sequenced (due to limitations of current sequencing technologies) but are there. Repeats identified are generally shown in lower case characters (see this).

If you want to get ready to use aligner indexes the check out Illumina's iGenomes page. The downloads contain sequence, indexes for bwa/bowtie/bowtie2 and annotation bundles for many model organisms.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax87k

If you want to get ready to use aligner indexes the check out Illumina's iGenomes page.

Actually I am not interested to use aligner indexes, but interested to built one by trying something different.

I think it depends on the aligner to consider these identifiers for alignment, and can simply be ignored if not needed, though I may be wrong.

ADD REPLYlink written 3.2 years ago by AHW40
1

Ignoring the identifiers would make the results largely worthless.

ADD REPLYlink written 3.2 years ago by Devon Ryan96k

Well!! can you please provide me some links which discuss the role of identifiers for sequence alignment, as I am not a biologist.

ADD REPLYlink written 3.2 years ago by AHW40
1

Aligning without specifying the 'identifier' would be equivalent to give directions to your house by just giving the house number and not giving the street name.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k

If you don't have a handle to point to as to where a particular bit of sequence you have identified as aligning to, how would be represent that in alignment results? A biologist needs to where (location/identifier) the alignment belongs to, along with coordinates that indicate the exact location on that 1D stretch of sequence.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax87k

Great , is it not enough to give the location where a particular read is aligning including forward or reverse strand information. For example:

Reference: AGCTGGCATGCAAAGTCAGTCAAATGCGTACGTCA
Read :     AGCTGGCAT

and then in the alignment results, it would be read is matching at location 0 of the reference genome and is forward strand, represented as +. And may be some more information like read name, read quality, number of copies and so on. What else is needed??

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by AHW40
1

No, such a result without knowing exactly where that is in the genome is useless.

ADD REPLYlink written 3.2 years ago by Devon Ryan96k
1
gravatar for Michael Dondrup
3.2 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Just to add an explanation to what Devon wrote: What you find in an Assembly like the one you downloaded is not "the genome", but a model of the genome on a best effort basis with the intention to represent "the Genome" in a single linear string. There are a lot of imperfections in this approach like gaps, ambiguity (non ACGT characters), variation, alternative sequences, regions where there are repeats of sometimes unknown length, bits of sequence that could not be placed on a chromosome or not placed at an exact position like the >chrY_KI270740v1_random sequence. An indexing method needs to be able to deal with all or most of these. With all this variation and more sequenced individual genomes, a graph might be a much better representation of "the genome".

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Michael Dondrup47k

Thank you for the answer. From the above comments and answers, I got to know the importance of having the chromosomes as identifiers for sequence alignment process. As there are many faults in genome model such as gaps, ambiguity, repeated regions, random chromosome positions like >chrY_KI270740v1_random sequence and any aligner should take care of all this.

In general, what is sensible to do with all these situations:

  1. Should I ignore any gap in the reference genome?
  2. Should I consider repeated regions of the reference genome if any read falls there?
  3. Are identifiers like >chrY_KI270740v1_random valid identifiers and should be reported in the alignment results like other valid identifiers?
ADD REPLYlink written 3.2 years ago by AHW40
1
  1. The general strategy is to use a lower penalty when aligning against an N, since it's not clear whether it really constitutes a mismatch.
  2. Yes
  3. Yes
ADD REPLYlink written 3.2 years ago by Devon Ryan96k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1486 users visited in the last hour