Get -400bp to +50bp sequence of gene from gff3 file
2
0
Entering edit mode
3.3 years ago
Biogeek ▴ 420

Dear all,

I have used bedops and bedtools to obtain -2000bp upstream of my gene start (based on gff3 coordinates) file and the scaffold fasta file of the genome.While I know how to get upstream, I do; however want to get per say +50bp upstream of the gene start too. Is there a way of getting the region per say, -400bp to +50bp? I've seen papers taking the 'promoter' region as this 450bp region. I am assuming that selection of region size is purely arbitrary?

I've used MEME to scan for common 'upstream' promoter regions with my -2000bp upstream regions of the genes of interest and one motif I get back is about 30bp long with high % of Adenine(70-80%). I am thinking that this may be an overlapping polyA tail of a gene upstream? I am not sure wether to discard this as such.

Many thanks.

upstream downstream promoter motifs • 2.0k views
ADD COMMENT
2
Entering edit mode

Polyadenylation is a posttranscriptional modification which means there is no polyA-tail in the genome for every transcript. PolyA tails are added by special enzymes to the transcript but not transcriped directly from the genome. Therefore that cannot be the cause for any nucleotide enrichment you see. Maybe some more details about the motif would help.

ADD REPLY
1
Entering edit mode

Get upstream flank of gene AND downstream of gene start

Title of this question is misleading in the context of the contents of the post. You may want to change that to reflect the actual question.

It sounds like the question you have is about interval selection being of arbitrary size.

ADD REPLY
0
Entering edit mode
3.3 years ago
shawn.w.foley ★ 1.3k

You can extract out any arbitrary window using awk (or any coding language). Essentially you're asking to find the transcription start site (TSS) - 2000 and TSS + 50. You just need to be mindful of the strand a gene is coded on. GFF and BED files are formatted such that the "start" coordinate is always less than the "end" coordinate. Therefore the TSS for a gene on the minus strand corresponds to the "end" coordinate.

For a gff3 file formatted like:

1 ensembl five_prime_UTR 925741 925800 . + . Parent=transcript:ENST00000616125

You can generate a BED output like:

1 923741 925791 Parent=transcript:ENST00000616125 . +

By using this code:

awk 'BEGIN{OFS="\t";FS="\t"} {if ($7 == "+") print $1,$4-2000,$4+50,$9,".",$7; else print $1,$5-2000,$5+50,$9,".",$7}'

This command takes a tab-separated input and generates a tab separated output. If the gene is on the plus strand ($7 == "+"), it prints the 4th column - 2000, then the 4th column + 50. If it is not on the plus strand, it will perform these operations on the 5th column.

ADD COMMENT
1
Entering edit mode

You should make sure that no out-of-bounds, so start < 0 and end > length(chromosome) can happen.

ADD REPLY
0
Entering edit mode
3.3 years ago

You can do an asymmetric padding around strand-separated TSSs with bedops --range:

$ gff2bed < genes.gff > genes.bed

$ awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }' genes.bed \
    | bedops --range -400:50 --everything - \
    > promoters.for.bed

$ awk -vFS="\t" -vOFS="\t" '($6 == "-"){ print $1, $3, ($3 + 1), $4, $5, $6; }' genes.bed \
    | bedops --range -50:400 --everything - \
    > promoters.rev.bed

$ bedops --everything promoters.for.bed promoters.rev.bed > promoters.bed

One advantage of using a toolkit like BEDOPS over awk is that bedops deals with the left-edge case, where the upstream edge would be reported as less-than-zero, without bounds checking.

The right edge rarely matters, although you could pipe results to a genome-specific bounds file generated with UCSC Kent Utilities fetchChromSizes, in order to filter out results that go outside the chromosome bounds. For instance, for hg38:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" '{ print $1, "0", $2; }' \
    | grep -vE '_' \
    | sort-bed - \
    > hg38.nuc.bed

$ bedops --element-of 100% promoters.bed hg38.nuc.bed > promoters.filtered.bed

BEDOPS is (purposefully) agnostic about genomes, which keep changing. Doing this manually is a little more work, but then you know exactly how your data was generated, which leads to fewer questions about results.

You would then run your BED-formatted promoters —filtered or not— through something like scripted calls to samtools faidx, to convert to strand-aware, FASTA-formatted sequence, using your assembly of choice. I outline one such approach in my Stack Exchange answer here: https://bioinformatics.stackexchange.com/a/5374/776

You can define your promoters any way that you like. I've seen literature go 1kb to 5kb upstream and 0 to 500nt downstream of the TSS. It is up to you and the problem you are trying to explore.

ADD COMMENT
0
Entering edit mode

Hi, I'm encountering the same situation and the answer above was really helpful. Just one small thing to confirm. Since gtf/gff format is 1-base start and 1-base end while bed format is 0-base start and 1-base end, shouldn't we use the below instead of your suggestion even in the minus strand?

awk -vFS="\t" -vOFS="\t" '($6 == "-"){ print $1, ($3-1), $3, $4, $5, $6; }' genes.bed \
| bedops --range -50:400 --everything - \
> promoters.rev.bed

I'm reading this post and got a little bit confused now. I'm still new in this field and I'm sorry if suggesting wrong points.

ADD REPLY

Login before adding your answer.

Traffic: 2204 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6