Hi
This is raw read counts extracted by htseq from STAR alignment
> head(counts[1:2,1:4])
Sample1 Sample2 Sample3
1 ENSG00000000003 115 437 380
2 ENSG00000000005 0 0 0
> dim(counts)
[1] 58735 17
>
By this code I tried to find matched gene symbol for ensmbl gene id
> ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
> values=rownames(counts)
> data <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = values, mart= ensembl)
> merged_data = merge(x = data, y = counts, by.x = colnames(data)[1], by.y = colnames(counts)[1], all = T)
But in resulted matrix for a lot of ensmbl gene id I don't have gene symbol, when I removed them I finished with a smaller matrix
> dim(new_counts)
[1] 25052 16
>
In your experience is it normal? what should I do then? Please help me to get the right matrix for down stream analysis
Thank you
See the answer here: What is the difference between transcript id and Ensembl gene id
Do you have transcript variants for genes in your list? That can explain the larger numbers you initially had.
Also see: ENSEMBL annotation file for quantification: which file to use?