batch sequence retrieval using exon coordinates
2
0
Entering edit mode
16 months ago
shallot • 0

Hello, I am new to bioinformatics and new to this forum as well. So, I have a list of 1400+ gene names, chromosome number and their exon coordinates (3 exons/gene). I am looking for a way to get the domains associated with all these exons. So, is there any way to retrieve the sequences for each of these exons (for all genes!) in a batch manner and feed it to some other tool (*) to get the domains associated? Any thread regarding the same?

(*) - I only know of batch CD which requires protein query so please let me know if there is any other batch CD tool that works with nucleotide query

retrieval sequence • 1.2k views
ADD COMMENT
0
Entering edit mode

You can probably use NCBI datasets to download this information in batch. See: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genes/download-gene-data-package/

ADD REPLY
0
Entering edit mode

you can take a look an old post that I wrote https://crazyhottommy.blogspot.com/2015/04/get-all-promoter-sequences-of-human.html

Just change the promoter coordinates to the exon coordinates.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

GenoMax Ming Tang @mohammadhassanj Thanks for the replies. I did try some of the solutions, especially biomart and they didn't work out. Somehow, I figured out how to use table browser and I got what I wanted. But I am definitely trying these solutions again in my free time. Thanks a bunch.

ADD REPLY
0
Entering edit mode
16 months ago

Hi, using biomaRt package you can do this, for more options read it's document in below link:

pdf: https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

# if gene_list.txt :

    # Symbol 
    # A1BG
    # A2M
    # A2MP1
    # NAT1
    # NAT2
    # NATP
    #   .
    #   .
    #   .

library(biomaRt)
gene.names = read.delim("gene_list.txt")
mart <- useEnsembl("ensembl", dataset="hsapiens_gene_ensembl",mirror = "uswest")
for (gene in gene.names){

  seq = getSequence(id = gene,
                    type = "hgnc_symbol",
                    seqType = "gene_exon",
                    mart = mart)
  write.table(seq,"exon_per_genes.txt" ,append=TRUE,sep="\t",row.names = F,col.names = F,quote = F)
} 
ADD COMMENT
0
Entering edit mode
16 months ago
GenoMax 142k

I am looking for a way to get the domains associated with all these exons.

Using Entrezdirect:

 esearch -db gene -query "A2M [GENE] AND human [ORGN]" | elink -target cdd | esummary | xtract -pattern DocumentSummary -element Accession,Title,Subtitle,Database
cl11960 Ig      Immunoglobulin domain   Cdart
pfam17791       MG3     Macroglobulin domain MG3        Pfam
pfam17789       MG4     Macroglobulin domain MG4        Pfam
pfam07703       A2M_N_2 Alpha-2-macroglobulin family N-terminal region  Pfam
pfam07678       A2M_comp        A-macroglobulin complement component    Pfam
pfam07677       A2M_recep       A-macroglobulin receptor        Pfam
pfam01835       A2M_N   MG2 domain      Pfam
pfam00207       A2M     Alpha-2-macroglobulin family    Pfam
cl08267 ISOPREN_C2_like Cdart
cd05768 IgC1_CH3_IgAGD_CH4_IgAEM        CH3 domain (third constant Ig domain of the heavy chain) in immunoglobulin heavy alpha, gamma, and delta chains, and CH4 domain (fourth constant Ig domain of the heavy chain) in immunoglobulin heavy alpha, epsilon, and mu chains; member of the C1-set of Ig superfamily (IgSF) domains     Cdd
cd04986 IgC1_CH2_IgA    CH2 domain (second constant Ig domain of the heavy chain) in immunoglobulin heavy alpha chain; member of the C1-set of Ig superfamily (IgSF) domains        Cdd
cd04981 IgV_H   Immunoglobulin (Ig) heavy chain (H), variable (V) domain        Cdd
cd02897 A2M_2   Cdd
ADD COMMENT

Login before adding your answer.

Traffic: 1525 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6