Question: How I deal with raw read counts of RNA-seq
0
gravatar for Angel
11 days ago by
Angel3.5k
Angel3.5k wrote:

Hi

This is raw read counts extracted by htseq from STAR alignment

> head(counts[1:2,1:4])
                 Sample1 Sample2 Sample3
1 ENSG00000000003  115  437  380
2 ENSG00000000005    0    0    0
> dim(counts)
[1] 58735    17
>

By this code I tried to find matched gene symbol for ensmbl gene id

>   ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>   values=rownames(counts)
>    data <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = values, mart= ensembl)


> merged_data = merge(x = data, y = counts, by.x = colnames(data)[1], by.y = colnames(counts)[1], all = T)

But in resulted matrix for a lot of ensmbl gene id I don't have gene symbol, when I removed them I finished with a smaller matrix

> dim(new_counts)
[1] 25052    16
>

In your experience is it normal? what should I do then? Please help me to get the right matrix for down stream analysis

Thank you

rna-seq star deseq2 • 154 views
ADD COMMENTlink modified 11 days ago by swbarnes26.5k • written 11 days ago by Angel3.5k

See the answer here: What is the difference between transcript id and Ensembl gene id

Do you have transcript variants for genes in your list? That can explain the larger numbers you initially had.

Also see: ENSEMBL annotation file for quantification: which file to use?

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax71k
1
gravatar for dextereyes88
11 days ago by
dextereyes8810
dextereyes8810 wrote:

I prefer to use the bitr function in the clusterProfiler package for mapping gene symbols to ensembl IDs. It will tell you if there is 1:1 matching or many:1 matching, and will print the percentage of non-mapped IDs. https://yulab-smu.github.io/clusterProfiler-book/chapter14.html#bitr

It is certainly possible that the dimensions will change. Especially if the Ensembl IDs in your data set come from a different release of the Ensembl annotation used to map those IDs. Why are you mapping the gene symbols first? Are you doing a targeted analysis? If you are doing a genome wide analysis I would do your DESEq analysis using the ensembl IDs and filter the genes based on some expression threshold across your samples, then add gene annotation using something like BioMart.

ADD COMMENTlink written 11 days ago by dextereyes8810

I know I have used Homo_sapiens.GRCh38.dna.primary_assembly.gtf for extracting raw read counts, then how I tell biomaRt to use the same version for id conversion?

ADD REPLYlink modified 11 days ago • written 11 days ago by Angel3.5k
> ids <- bitr(hg, fromType="ENSEMBL", toType=c("SYMBOL"), OrgDb="org.Hs.eg.db")
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(hg, fromType = "ENSEMBL", toType = c("SYMBOL"), OrgDb = "org.Hs.eg.db") :
  55.7% of input gene IDs are fail to map...

Also htseq returns gene name by --idattr=gene_name , again a lot of non coding genes like AL390728.6. I am not sure should I ignore them like I am doing with id conversion or not

Don't you think I am losing too much information by this conversion?

ADD REPLYlink modified 11 days ago • written 11 days ago by Angel3.5k
1
gravatar for swbarnes2
11 days ago by
swbarnes26.5k
United States
swbarnes26.5k wrote:

I'd try "external_gene_name" instead of hgnc symbols. With 25k things having matching names, it looks like hgnc names might only match protein coding genes, but ensembl IDs will include a lot more than that.

ADD COMMENTlink modified 11 days ago • written 11 days ago by swbarnes26.5k

@ swbarnes2 I done by external_gene_name now I have a lot of AC120036.5 and so on

Do you think I can ignore them and rely on my data gotten by hgnc_symbol

ADD REPLYlink written 11 days ago by Angel3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 935 users visited in the last hour