how to annotate RNA-seq
2
1
Entering edit mode
4.7 years ago
Learner ▴ 250

I have many files like this

ENSG00000000003.13  366
ENSG00000000005.5   26
ENSG00000000419.11  1905
ENSG00000000457.12  775
ENSG00000000460.15  377
ENSG00000000938.11  316
ENSG00000000971.14  272
ENSG00000001036.12  1726
ENSG00000001084.9   2479
ENSG00000001167.13  1166
ENSG00000001460.16  38
ENSG00000001461.15  298


They are all htseq-count.

I have two questions

1- how can I convert the ID to gene names? do you have any solution available (I can use any language) ? 2- what would you use for normalizing them?

RNA-Seq • 4.1k views
1
Entering edit mode

The normalization will depend of what you want to do next. By exemple, if you want to apply a differential expression analysis, some tools have there own normalization (Deseq2 use RLE, EdgeR use TMM...)

0
Entering edit mode

Hi Learner,

0
Entering edit mode

@WouterDeCoster please give a proper solution and I will absolutely like and accept it. I don't accept stuff which does not help. the main purpose of this web is to learn and help (at least this is in my case)

3
Entering edit mode
4.7 years ago
erwan.scaon ▴ 900

Create a TSV file with your input => list.tsv

Remove ".13" etc from your ENSG (list.tsv) => list_ensembl_gid.tsv

awk -F '\t' -v OFS='\t' '{sub(/\.[0-9]*/, "", $1)} 1' list.tsv > list_ensembl_gid.tsv;  Then use R & Biomart to retrieve corresponding genes names R data = read.table("list_ensembl_gid.tsv") colnames(data)[1] <- "ensembl_gene_id" colnames(data)[2] <- "counts" library('biomaRt') hsapiens = useMart("ensembl", dataset="hsapiens_gene_ensembl") hsapiens_infos <- getBM(attributes=c('ensembl_gene_id', 'external_gene_name'), mart = hsapiens) merge_infos <- merge(x = data, y = hsapiens_infos, by = "ensembl_gene_id", all.x = TRUE) head(merge_infos)  ensembl_gene_id counts external_gene_name ENSG00000000003 366 TSPAN6 ENSG00000000005 26 TNMD ENSG00000000419 1905 DPM1 ENSG00000000457 775 SCYL3 ENSG00000000460 377 C1orf112 ENSG00000000938 316 FGR ADD COMMENT 0 Entering edit mode @erwan.scaon I use the same approach. However, I am losing many genes I have about 200000 gene ID and when I use it, it will shrink to 15000 so I lose like 5000. do you have any idea about this? ADD REPLY 1 Entering edit mode 4.7 years ago 1. You can download the mapping from biomart (you want the "gene stable ID version" and the "gene name"). You can merge them in R or any other language. 2. You probably want to use something like DESeq2, but without more details it's impossible to know what normalization would be appropriate. ADD COMMENT 0 Entering edit mode @Devon Ryan thank you very much for your reply. The problem with biomart is that I have about 200000 gene ID and when I use it, it will shrink to 15000 so I lose like 5000 (I dont know why). I basically want to find upregulated and down regulated genes from transcriptome (based on htseq-count) what do you suggest now? ADD REPLY 0 Entering edit mode There aren't 200000 genes, perhaps you meant 20000? Either way, presumably you're using an old annotation (e.g., from hg19), in which case you'll have to use the appropriate archive version of biomart. ADD REPLY 0 Entering edit mode @Devon Ryan Yes 20000 roughly. I use hsapiens_gene_ensembl , can you please tell me how to find the newest version? and also how would you normalize them? ADD REPLY 0 Entering edit mode I linked to the most recent version, you're likely using an old one. Post some of the IDs that don't have a match and we can have a look. ADD REPLY 0 Entering edit mode @Devon Ryan can you please tell me how to identify which one is recognised and which one not? I am using the above R code. If you can tell me I will post some of the IDs ADD REPLY 0 Entering edit mode some_data_frame$ID[which(!(some_data_frame$ID %in% annotation$ID))] or something along those lines.

0
Entering edit mode

@Devon Ryan for example these ones

ENSG00000122718
ENSG00000130201
ENSG00000150076
ENSG00000150526
ENSG00000155640
ENSG00000166748
ENSG00000168260
ENSG00000168787
ENSG00000170590
ENSG00000170803
ENSG00000171484
ENSG00000172381
ENSG00000172774

0
Entering edit mode

As I suspected, you're using a very old annotation.

0
Entering edit mode

@Devon Ryan can you please tell me how to make it new ? how to use the new one?

0
Entering edit mode

Just download the new version and use it. Note that you'll presumably need to realign your data if you aligned it to hg19 as well. If you don't want to do that then use a GTF file from hg19.

0
Entering edit mode

0
Entering edit mode

If you've downloaded htseq-count files with hg19 IDs then you just need to download the hg19 ID->Gene mapping from the archived version of biomart.

0
Entering edit mode

@Devon Ryan can you please show me how you convert the above genes? I appreciate your comment but I have been trying to do all what you said with no sucess. So I would appreciate if you can show me how you do it

0
Entering edit mode
1. Go here.
2. Select "Ensembl Genes 91", then "Human genes"
3. Under "Attributes", deselect "transcript ID" and select "gene name".
4. Save the results

Note that Ensembl seems to be having technical issues at the moment, so you might have to wait until the morning here in Europe.