Question

Questions about reference genome and alignment

0

Entering edit mode

3.7 years ago

nhaus ▴ 300

Hello everybody,

I am fairly new to bioinformatics, so I have some basic concept questions that I hope you guys can help me to understand.

I hope I'll manage to express myself properly, so you know what my questions are about.

Let's say I download a reference genome, e.g. the new GRCh38. What I get is essentially a huge string (plus headers > 1 ….) of bases and N.

If I understand it correctly, this is 5’ to 3’ nucleotide sequence of the respective chromosome. So here is my first question: Given that the DNA we are sequencing is a double strand, which single strand is depicted on the reference genome? Adding to that question: are different reference sequences e.g. hg38 and GRCh38 using the same single strand?

If so, how was it determined which strand is the “reference stand”.

Now to my next question. Imagine we are doing paired-end DNA seq, an prepare our library, so that the first read should align to the forward strand and the second read should align to the reverse strand (if I am not mistaken it gets represented like this F1R2).

When aligning the reads to my reference sequence, the first read of the pair, should be exactly present in the reference sequence, base for base (given no SNP/INDEL). Theoretically, I could search through my GRCh38.fa file (e.g. with Strg+F) and find the exact match (again, given no SNP/INDEL/softclipping). Is that correct?

For the read 2 of the pair it gets trickier. If I understand it correctly, the reverse complement of the sequenced read 2 should be present in the reference genome. If this is correct, does that mean when using an aligner like bwa mem, read 2 will get converted to the reverse complement on the fly and then aligned?

If this is the case, why don’t we have to tell bwa mem about our library layout (e.g. specify F1R2)? Can the program interfere the layout by trying to align a subset of the reads and look at the alignment rates?

If anybody could help me answer even some of these questions, I’d be extremely thankful!

Cheers!

genome alignment assembly • 2.8k views

ADD COMMENT • link updated 3.7 years ago by JC 13k • written 3.7 years ago by nhaus ▴ 300

1

Entering edit mode

Given that the DNA we are sequencing is a double strand, which single strand is depicted on the reference genome? Adding to that question: are different reference sequences e.g. hg38 and GRCh38 using the same single strand?

Sequencing always happens in 5' --> 3' direction. Strand present in reference happens to be the strand that was assembled from data. There is only one reference sequence build for major model organisms (human/mouse). It is produced by Genome Reference Consortium. GENCODE uses that sequence to annotate official gene features. Places like Ensembl, UCSC may provide additional annotation but the underlying reference sequence is the same everywhere. Only major genome build releases change chromosome co-ordinates. Minor/patches releases don't.

Theoretically, I could search through my GRCh38.fa file (e.g. with Strg+F) and find the exact match (again, given no SNP/INDEL/softclipping). Is that correct?

Since DNA is anti-parallel, each strand has an equal probability of getting sequenced (you are fragmenting DNA before making libraries). So you could do what you are proposing, as long as you do that for both strands in 5' --> 3' direction. Aligners automatically rev-comp your sequences when aligning to reference.

ADD REPLY • link 3.7 years ago by GenoMax 141k

score 3 · Accepted Answer · 2020-08-12

3

Entering edit mode

3.7 years ago

JC 13k

Genomes are reported as you pointed, from 5'->3' in the main chain.
hg38 and GRCh38 are almost the same, but not exactly, you can learn about the differences in assemblies here
Yes, short read alignment checks if the read 1 is on the sequence, but actually it checks both strands because common sequencing is a random selection (so, no strand-specific reads unless you use some special protocols).
After finding a match in the first read, it tries to find a match for the second read, preferably as the reverse chain.
BWA can infer the read orientations sampling some reads

Edited to remove gene orientation preference

ADD COMMENT • link 3.7 years ago by JC 13k

0

Entering edit mode

Thank you very much for your answer! This clears up a lot of things.

Genomes are reported as you pointed, from 5'->3' in the main chain, because there is a bias in the way genes are present in a chromosome, genes prefer one chain over the other, so the '+' chain is the one with more genes in the same orientation.

Regarding this: Did I understand it correctly that the string I am looking at in the reference fasta is the '+' chain, i.e. the strand that has the most genes in the same orientation?

Because I read somewhere else that the sequence in the reference starts at the telomer of the p-arm.

Is that wrong or is it just by chance the same as the '+' strand?

Also, if the '+' strand is always the strand which has the most genes in one orientation, does the direction (e.g. p-arm to q-arm) change between chromosomes?

E.g. the '+' strand of chr1 goes from p-arm to the q-arm, but the '+' strand of chr3 goes from the q-arm to the p-arm

ADD REPLY • link 3.7 years ago by nhaus ▴ 300

0

Entering edit mode

The p-arm and q-arm are defined as the short and long arms in a chromatid, not by the sequence. Here is a good reading to understand this. In addition, you can find more about the GC-skew in DNA.

ADD REPLY • link 3.7 years ago by JC 13k

0

Entering edit mode

Thank you for linking that paper. But if I understood it correctly, the authers propose that the reference strand should be the watson strand. And the watson strand is the strand which has the 5' end at the short arm (p-arm).

So isn't it correct if I am saying that the reference sequence starts at the 5' end of the p-arm for each respective chromosome?

ADD REPLY • link 3.7 years ago by nhaus ▴ 300

0

Entering edit mode

I mean, each consortium/who-sequence-the-organism can decide which is the reported strand

ADD REPLY • link 3.7 years ago by JC 13k

0

Entering edit mode

Genomes are reported as you pointed, from 5'->3' in the main chain, because there is a bias in the way genes are present in a chromosome

@JC This is roughly true for human genome but does not seem to be true for mouse. Do you have a reference for that?

Human genome latest GTF from GENCODE:

$ awk -F "\t" '{if ($3 == "gene" && $7 == "+") {print $0}}' gencode.v34.annotation.gtf | grep "protein_coding" | wc -l
10093
$ awk -F "\t" '{if ($3 == "gene" && $7 == "-") {print $0}}' gencode.v34.annotation.gtf | grep "protein_coding" | wc -l
9866

Mouse genome latest GTF from GENCODE

$ awk -F "\t" '{if ($3 == "gene" && $7 == "+") {print $0}}' gencode.vM25.annotation.gtf | grep "protein_coding" | wc -l
10913
$ awk -F "\t" '{if ($3 == "gene" && $7 == "-") {print $0}}' gencode.vM25.annotation.gtf | grep "protein_coding" | wc -l
10946

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

There is a mutation pressure caused by the GC skew (I remember Green's paper about it), the fact is that some genome consortiums assign chromosomes order and orientation based on different criteria.