Question: how to annotate RNA-seq
0
gravatar for Learner
5 weeks ago by
Learner 80
Learner 80 wrote:

I have many files like this

ENSG00000000003.13  366
ENSG00000000005.5   26
ENSG00000000419.11  1905
ENSG00000000457.12  775
ENSG00000000460.15  377
ENSG00000000938.11  316
ENSG00000000971.14  272
ENSG00000001036.12  1726
ENSG00000001084.9   2479
ENSG00000001167.13  1166
ENSG00000001460.16  38
ENSG00000001461.15  298

They are all htseq-count.

I have two questions

1- how can I convert the ID to gene names? do you have any solution available (I can use any language) ? 2- what would you use for normalizing them?

rna-seq • 282 views
ADD COMMENTlink modified 5 weeks ago by Devon Ryan76k • written 5 weeks ago by Learner 80
1

The normalization will depend of what you want to do next. By exemple, if you want to apply a differential expression analysis, some tools have there own normalization (Deseq2 use RLE, EdgeR use TMM...)

ADD REPLYlink written 5 weeks ago by Bastien Hervé180

Hi Learner,

Please provide some feedback on your previous threads before opening new threads.
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 5 weeks ago by WouterDeCoster26k

@WouterDeCoster please give a proper solution and I will absolutely like and accept it. I don't accept stuff which does not help. the main purpose of this web is to learn and help (at least this is in my case)

ADD REPLYlink written 5 weeks ago by Learner 80
2
gravatar for erwan.scaon
5 weeks ago by
erwan.scaon410
Limoges - CBRS - France
erwan.scaon410 wrote:

To answer your first question :

Create a TSV file with your input => list.tsv

Remove ".13" etc from your ENSG (list.tsv) => list_ensembl_gid.tsv

awk -F '\t' -v OFS='\t' '{sub(/\.[0-9]*/, "", $1)} 1' list.tsv > list_ensembl_gid.tsv;

Then use R & Biomart to retrieve corresponding genes names

R
data = read.table("list_ensembl_gid.tsv")
colnames(data)[1] <- "ensembl_gene_id"
colnames(data)[2] <- "counts"

library('biomaRt')

hsapiens = useMart("ensembl",
                    dataset="hsapiens_gene_ensembl")

hsapiens_infos <- getBM(attributes=c('ensembl_gene_id',
                                     'external_gene_name'),
                        mart = hsapiens)

merge_infos <- merge(x = data,
                     y = hsapiens_infos,
                     by = "ensembl_gene_id",
                     all.x = TRUE)

head(merge_infos)

ensembl_gene_id counts external_gene_name
ENSG00000000003 366 TSPAN6
ENSG00000000005 26 TNMD
ENSG00000000419 1905 DPM1
ENSG00000000457 775 SCYL3
ENSG00000000460 377 C1orf112
ENSG00000000938 316 FGR

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by erwan.scaon410

@erwan.scaon I use the same approach. However, I am losing many genes I have about 200000 gene ID and when I use it, it will shrink to 15000 so I lose like 5000. do you have any idea about this?

ADD REPLYlink written 5 weeks ago by Learner 80
0
gravatar for Devon Ryan
5 weeks ago by
Devon Ryan76k
Freiburg, Germany
Devon Ryan76k wrote:
  1. You can download the mapping from biomart (you want the "gene stable ID version" and the "gene name"). You can merge them in R or any other language.
  2. You probably want to use something like DESeq2, but without more details it's impossible to know what normalization would be appropriate.
ADD COMMENTlink written 5 weeks ago by Devon Ryan76k

@Devon Ryan thank you very much for your reply. The problem with biomart is that I have about 200000 gene ID and when I use it, it will shrink to 15000 so I lose like 5000 (I dont know why).

I basically want to find upregulated and down regulated genes from transcriptome (based on htseq-count) what do you suggest now?

ADD REPLYlink written 5 weeks ago by Learner 80

There aren't 200000 genes, perhaps you meant 20000? Either way, presumably you're using an old annotation (e.g., from hg19), in which case you'll have to use the appropriate archive version of biomart.

ADD REPLYlink written 5 weeks ago by Devon Ryan76k

@Devon Ryan Yes 20000 roughly. I use hsapiens_gene_ensembl , can you please tell me how to find the newest version? and also how would you normalize them?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Learner 80

I linked to the most recent version, you're likely using an old one. Post some of the IDs that don't have a match and we can have a look.

ADD REPLYlink written 5 weeks ago by Devon Ryan76k

@Devon Ryan can you please tell me how to identify which one is recognised and which one not? I am using the above R code. If you can tell me I will post some of the IDs

ADD REPLYlink written 5 weeks ago by Learner 80

some_data_frame$ID[which(!(some_data_frame$ID %in% annotation$ID))] or something along those lines.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Devon Ryan76k

@Devon Ryan for example these ones

ENSG00000122718
ENSG00000130201
ENSG00000150076
ENSG00000150526
ENSG00000155640
ENSG00000166748
ENSG00000168260
ENSG00000168787
ENSG00000170590
ENSG00000170803
ENSG00000171484
ENSG00000172381
ENSG00000172774
ADD REPLYlink written 5 weeks ago by Learner 80

As I suspected, you're using a very old annotation.

ADD REPLYlink written 5 weeks ago by Devon Ryan76k

@Devon Ryan can you please tell me how to make it new ? how to use the new one?

ADD REPLYlink written 5 weeks ago by Learner 80

Just download the new version and use it. Note that you'll presumably need to realign your data if you aligned it to hg19 as well. If you don't want to do that then use a GTF file from hg19.

ADD REPLYlink written 5 weeks ago by Devon Ryan76k

@Devon Ryan newer version ? I downloaded the latest version and still the same issue. I cannot download the raw files to reanalyse. I only have access to htseq-count. As you know there is no way to annotate them, I have been asking for this from various people and no one had a clue how to annotate them. It seems like there is no solution for this problem at the moment.

ADD REPLYlink written 4 weeks ago by Learner 80

If you've downloaded htseq-count files with hg19 IDs then you just need to download the hg19 ID->Gene mapping from the archived version of biomart.

ADD REPLYlink written 4 weeks ago by Devon Ryan76k

@Devon Ryan can you please show me how you convert the above genes? I appreciate your comment but I have been trying to do all what you said with no sucess. So I would appreciate if you can show me how you do it

ADD REPLYlink written 4 weeks ago by Learner 80
  1. Go here.
  2. Select "Ensembl Genes 91", then "Human genes"
  3. Under "Attributes", deselect "transcript ID" and select "gene name".
  4. Save the results

Note that Ensembl seems to be having technical issues at the moment, so you might have to wait until the morning here in Europe.

ADD REPLYlink written 4 weeks ago by Devon Ryan76k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour