Question: Different go terms for hg19 vs ensembl
2
gravatar for gregory.l.stone
2.1 years ago by
gregory.l.stone30 wrote:

I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment in human samples. For this type of study would you recommend UCSC or ensembl? I am concerned by the significant difference in GO terms and kegg pathway results that I get from alignment and differential expression analysis using GRCh37 ensembl genome vs hg19 genome for human samples. There was a significant difference between the two all the way through the workflow, which was to be expected since the two reference genomes and annotations are different. I was wondering which one is most reliable? My results are so different between the two that I'm finding it hard to make sense of it all. Any help would be greatly appreciated.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by gregory.l.stone30
2
gravatar for Petr Ponomarenko
2.1 years ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.6k wrote:

Could you please be more specific with your workflow, tools used and input files. Maybe you can share some of the input files and intermediate results with us? Overal, huge difference in GO terms for same genes in different annotations is strange.

You obviously looked at GRCh37/38(NCBI) vs hg19/hg38(UCSC) , right?

ADD COMMENTlink written 2.1 years ago by Petr Ponomarenko2.6k

Thank you for your response. I did look at that, but have also read other posts that suggest the results shouldnt be so different. I aligned my reads with STAR, and got dissimilar mapping. The following is samtools flagstat output for hg19 then ensembl aligned reads:

29991285 + 702821 in total (QC-passed reads + QC-failed reads)
2278391 + 64385 secondary
0 + 0 supplementary
0 + 0 duplicates
29991285 + 702821 mapped (100.00% : 100.00%)
27712894 + 638436 paired in sequencing
13871636 + 315364 read1
13841258 + 323072 read2
27665798 + 627952 properly paired (99.83% : 98.36%)
27665798 + 627952 with itself and mate mapped
47096 + 10484 singletons (0.17% : 1.64%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



30213813 + 705544 in total (QC-passed reads + QC-failed reads)
2286418 + 64605 secondary
0 + 0 supplementary
0 + 0 duplicates
30213813 + 705544 mapped (100.00% : 100.00%)
27927395 + 640939 paired in sequencing
13978944 + 316580 read1
13948451 + 324359 read2
27880148 + 630366 properly paired (99.83% : 98.35%)
27880148 + 630366 with itself and mate mapped
47247 + 10573 singletons (0.17% : 1.65%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I then counted genes with featureCounts, and noticed that the ensembl annotation has way more info. The first is output from a couple read summaries with hg19 annotation and then ensembl annotation:

|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/UCS ... ||
||    Features : 392959                                                       ||
||    Meta-features : 23228                                                   ||
||    Chromosomes/contigs : 47                                                ||
||                                                                            ||
|| Process BAM file  ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 33961661                                              ||
||    Successfully assigned fragments : 14868327 (43.8%)                      ||
||    Running time : 3.29 minutes                                             ||
||                                                                            ||
|| Process BAM file  ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 33899360                                              ||
||    Successfully assigned fragments : 18155450 (53.6%)                      ||
||    Running time : 3.44 minutes                                             ||


|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/Ens ... ||
||    Features : 1309155                                                      ||
||    Meta-features : 62069                                                   ||
||    Chromosomes/contigs : 244                                               ||
||                                                                            ||
|| Process BAM file ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 34164741                                              ||
||    Successfully assigned fragments : 27688203 (81.0%)                      ||
||    Running time : 0.52 minutes                                             ||
||                                                                            ||
|| Process BAM file ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 34067700                                              ||
||    Successfully assigned fragments : 28410464 (83.4%)                      ||
||    Running time : 0.65 minutes                                             ||

The big difference is obviously that the ensembl annotation contains way more chromosomes/contigs and features. Are these added pieces of information obfuscating my results? Are they unnecessary, or added benefits?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by gregory.l.stone30
1

could you please use buttons for formatting on top of the message area like "101 010" to format your text? It is very hard to read it at the moment. Thank you

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k
1

Im sorry, this is my first post!I just made edits for clarity, sorry again

ADD REPLYlink written 2.1 years ago by gregory.l.stone30
1

Now much better =) Thank you

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k
1

What biological questions are you trying to answer with this analysis?

Ensebl has an annotation with many more features. This is well known and ok. You may open say RB1 gene in NSBI or UCSC and compare how many more data is in Ensembl for it. This is ok when you know what and why you are doing. Say for clinical genetics you are most interested only in one very particular transcript of RB1 and it is present in both annotations and almost identical. So all of other data is irrelevant for the first run of analysis in both annotations.

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k

I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment. For this type of study would you recommend UCSC or ensembl?

ADD REPLYlink written 2.1 years ago by gregory.l.stone30
1

humans? If you want to publish with others able to use it in clinical research or settings, I would recommend reading about ACMG guidelines, about LRG and, personally, I would go with longest transcripts from RefSeq for each protein-coding genes first. If nothing interesting or meaningful will be found ask Biostars once again. How many reads you have? What is the treatment? What is the coverage? Be aware of repetitive and low-complexity regions.

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k

Yes, human samples. I have a paired study, with an average of 50 million paired end reads per sample. Thank you very much for your thoughtful answers and for being so generous with your time!

ADD REPLYlink written 2.1 years ago by gregory.l.stone30

you are welcome, gregory.l.stone =)

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k

I think your question might be useful for many others. Could you please edit your question so it is easier for others to understand the scope of it and mark it one of the answers as accepted so others now that it was resolved. Maybe you can also summarize what you will find in ACMG guidelines and LRG later and add it here as well for others.

Also, I would love if others double-check my answers. My understanding can be not the best. But from my experience RefSeq and HGVS are very important for clinical people to understand any findings.

Thank you.

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k

Will do, thank you again for the help

ADD REPLYlink written 2.1 years ago by gregory.l.stone30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour