How to extract the DNA sequence 1000 bp around the Transcription start site (TSS) of a gene symbol in NCBI gene with R?
1
0
Entering edit mode
12 months ago
sunyeping ▴ 110

I am trying to extract the DNA sequence 1000 bp around the Transcription start site (TSS) of a gene, and I get a code as follows:

   # Load the biomaRt package
    library(biomaRt)

    # Connect to the Ensembl database through the biomaRt interface
    ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

    # Enter the gene symbol of interest
    gene_symbol <- "ITGAE"

    # Get the Ensembl gene ID for the gene symbol of interest
    gene_id <- getBM(filters = "external_gene_name", 
                     values = gene_symbol, 
                     attributes = "ensembl_gene_id", 
                     mart = ensembl)

    # Get the TSS position for the gene of interest
    tss <- getBM(filters = "ensembl_gene_id", 
                 values = gene_id, 
                 attributes = c("chromosome_name", "start_position", "strand"), 
                 mart = ensembl)

    # Extract the DNA sequence 1000 bp around the TSS
    seq <- getSequence(id = tss$chromosome_name, 
                       start = tss$start_position - 1000, 
                       end = tss$start_position + 1000,
                       type = 'ensembl_transcript_id',
                       seqType = 'transcript_flank',
                       mart = ensembl)

    # Print the DNA sequence
    cat(seq[[1]])

This can get the correct gene id, but cannot get the correct sequence. Do you have any idea how to correct this code? Or is there a better way to achieve the goal?

Any help will be deeply appreciated.

Transcription-start-site R biomart • 1.0k views
ADD COMMENT
0
Entering edit mode

It looks like the strand is not getting specified when running getSequence (and that there isn't an option for specifying strand) so are you getting the complimentary sequence?

ADD REPLY
0
Entering edit mode

Additional info from getSequence

In MySQL mode the getSequence function is more limited and the sequence that is returned is the 5' to 3'+ strand of the genomic sequence, given a chromosome, as start and an end position. So if the sequence of interest is the minus strand, one has to compute the reverse complement of the retrieved sequence, which can be done using functions provided in the matchprobes package. The biomaRt vignette contains more examples on how to use this function.

ADD REPLY
0
Entering edit mode
12 months ago
Darked89 4.6k

There is a bit of a problem with either getting one TSS per gene simply taking gene start and with one TSS per transcript. The first can give you TSS of a very rare transcript, the second a bunch of TSSs at the same position or off by few/10+ of bps.

Using ensembl canonical transcripts (one per gene) fixes "super rare transcript first exon gene start" but it may not be what you want.

Well, you can download GFF from ENSEMBL, select aset of genes, use https://github.com/junjunlab/GetTss to get BED with TSSs, then bedtools to extend it and extract fasta

ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6