Hii all, I am working on microarray data analysis pipeline. I want protein coding genes, i.e all RefSeq genes of Human.to create a database. From where I can get the genes or is there any way to create a database of these genes? Please help me.
The easiest way to get symbol and information on protein coding genes is through the NCBI gene resource page, here is a link for refseq protein-coding genes that you can re-generate with this query:
"Homo sapiens"[Organism] AND ("genetype protein coding"[Properties] AND "srcdb refseq"[Properties] AND alive[prop])
to download the output table as a file, just click
Send to: at the top of the page and select
If you are trying to download the sequences of all protein-coding transcripts then go to this page and use the 'Download Assembly' button, choose 'RefSeq' as source and download 'RNA FASTA (.fna)' file. This has both non-coding and coding transcript sequences. You can then use seqkit to extract all protein coding transcripts as follows:
seqkit grep -r -p '[NX]M_\d+\.\d+' GCF_000001405.39_GRCh38.p13_rna.fna.gz -o protein_coding_tx.fna