Question

How to map find specific genes in RNA Seq dataset ?

0

Entering edit mode

4.0 years ago

adi441994 • 0

Hello, I am new to the area of bioinformatics, so apologies if this is too obvious of a query.

I need to analyze the RNA Seq data from GSE98455.

This a RNA-Seq dataset for Rice and is of the following format:

|---------------------|------------------|
|    Some Id          |      Counts      | 
|---------------------|------------------|
|    13101.t00001     |       392        |
|---------------------|------------------|
|    13101.t00002     |        20        |
|---------------------|------------------|

The platform for this data is Illumina HiSeq 2500.

My question is how do I map a certain rice gene to the Id column so I can extract the appropriate counts ? For example if I want to find the count for the gene OsNAC6 then how do I map this to the ID column ?

Thank you for your insights.

RNA-Seq • 1.2k views

ADD COMMENT • link updated 4.0 years ago by piyushjo ▴ 700 • written 4.0 years ago by adi441994 • 0

1

Entering edit mode

You can find the Gene name and Gene ID from the annotation used in the alignment. Now in the associated paper I couldn't see which annotation they have taken, but they mention "Oryza sativa japonica reference genome v 6". Try finding genome annotation associated with that. Or probably contact the authors and ask them for the Gene ID and Gene Name file.

ADD REPLY • link 4.0 years ago by piyushjo ▴ 700

0

Entering edit mode

I see I will try to look for the genome, if I can't I will get in touch with the authors. Thank you for the directions, I wouldn't have figured this out myself.

ADD REPLY • link 4.0 years ago by adi441994 • 0

1

Entering edit mode

I looked a little bit more. These id seems to be from annotation for MSU v6 (MSU Rice Genome Annotation Project osa1r6). I couldn't find any annotation file for that. I tried looking into the rice database but with no luck. I think approaching the authors would be an easy and fast way. Good luck!

ADD REPLY • link 4.0 years ago by piyushjo ▴ 700

0

Entering edit mode

I searched for it as well no luck. I reached out the author, hopefully they can help me out. Is it usual to provide datasets without such key files ? Or is it a security thing ?

ADD REPLY • link 4.0 years ago by adi441994 • 0

1

Entering edit mode

They are required to upload raw and processed data. Nobody checks if the processed data is actually useful by itself. Since it is rice genome, it is very hard to find the information. When I google search the id you mentioned, it showed a post where the id was listed for rice genome and it was from a annotation file (.gff), but I couldn't locate the .gff file on Ensembl or MSU. If you don't get response from the authors, try posting again with new heading saying where can you get the gff or gtf file for MSU v6, I think that will help you. If you have some bioinformatician or you yourself can re-align the raw sequencing files then you don't need to find the specific annotation file.

ADD REPLY • link 4.0 years ago by piyushjo ▴ 700

0

Entering edit mode

Thank you for the detailed response, you have kind and generous. I will update here if I get a response from the author. I will also reach out to the bioinformatician on our team, perhaps she can help out. It sad that there is data, but It is not usable.

ADD REPLY • link 4.0 years ago by adi441994 • 0

score 0 · Answer 1 · 2020-05-02

Found it.

This is the link to genome v6 data. http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.0/all.dir/

And below you can download the annotation file. Here you will find the gene id and gene name.

http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_6.0/all.dir/all.gff3.gz

Ask a bioinfomatician to get the gene id and gene name in an .csv or tab format that you can open on excel. Beware, sometimes excel recognizes date formats and convert it into date. So for Human gene MARCH3, it converts it into a date, that is not a character anymore. So if you see some dates in excel, like 1-mar, 7-spet, they were probably genes with names close to date formats.

If you plan to realign the data, use the newer v7 genome and annotation.

Hope this helps!. I posted this as an answer now, because this is my answer :D