Question: How to make bed file of TSS (approximately 2kb, 1 kb upstream and 1kb downstream) of genes
0
gravatar for genomics_student
5 weeks ago by
genomics_student0 wrote:

Hello,

I am a beginner in bioinformatics in general, so please excuse my question if it seems easy to you.

I have identified differentially expressed genes using Deseq2. Next, I want to create a bed file containing these genes and around 2kb region of the promoter region, 1kb upstream and 1 kb downstream of the transcription start site using another gene reference file like this.

If the gene is in (-) position, the start site is column3 and the end site is column2, while if a gene is in (+) position, the start site is column2 and the end site is column 3.

Could you help me with the bioinformatics command/script using perl? Thank you very much in advance.

Chr           Column2          Column3      Gene
chr1    134199214       134235457       Adora1(-)
chr1    134199214       134235457       Adora1(-)
chr1    134199214       134235457       Adora1(-)
chr1    134199214       134235457       Adora1(-)
chr1    134199218       134235052       Adora1(-)
chr1    8359738 8679095 Sntg1(-)
chr1    8359738 9250479 Sntg1(-)
chr1    8359738 9250481 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9301062 Sntg1(-)
chr1    8359738 9301063 Sntg1(-)
chr1    8359738 9301064 Sntg1(-)
chr1    8359738 9301064 Sntg1(-)
chr1    25067475        25829707        Adgrb3(-)
chr1    25067549        25830205        Adgrb3(-)
chr1    25067675        25531950        Adgrb3(-)
chr1    25067675        25559840        Adgrb3(-)
chr1    33453807        33669794        Prim2(-)
chr1    33540151        33669789        Prim2(-)
chr1    33540152        33669789        Prim2(-)
chr1    33540152        33669789        Prim2(-)
chr1    58711306        58758884        Cflar(+)
chr1    58711306        58758884        Cflar(+)
chr1    58711490        58759209        Cflar(+)
chr1    58711490        58759209        Cflar(+)
chr1    58711598        58758884        Cflar(+)
chr1    58712172        58758884        Cflar(+)
chr1    58712445        58758884        Cflar(+)
chr1    58713181        58758884        Cflar(+)
chr1    58713183        58758884        Cflar(+)
chr1    58713285        58733227        Cflar(+)
chr1    58713285        58759209        Cflar(+)
chr1    75485824        75506452        Obsl1(-)
chr1    75485824        75506658        Obsl1(-)
chr1    75485824        75506658        Obsl1(-)
chr1    75486776        75506658        Obsl1(-)
chr1    75486865        75506658        Obsl1(-)
chr1    75492631        75506658        Obsl1(-)
chr1    75496336        75506452        Obsl1(-)
chr1    125676995       125873862       Gpr39(+)
chr1    167688899       167848741       Lmx1a(+)
chr1    167688905       167848741       Lmx1a(+)
chr1    167689557       167848733       Lmx1a(+)
chr1    175962300       176214433       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962305       176213942       Pld5(-)
chr1    175962305       176275312       Pld5(-)
chr1    175963071       176274979       Pld5(-)
chr1    175963091       176213932       Pld5(-)
chr1    184527840       184557691       1700112H15Rik(-)
chr1    192885673       193035442       Syt14(-)
chr1    192885673       193035616       Syt14(-)
chr1    192885673       193035616       Syt14(-)
chr1    192891232       192980428       Syt14(-)
chr1    192891232       192986803       Syt14(-)
chr1    192891232       192986803       Syt14(-)
rna-seq tutorial • 106 views
ADD COMMENTlink modified 5 weeks ago by Istvan Albert ♦♦ 84k • written 5 weeks ago by genomics_student0
1

Can you give a bit more info as to your final goal here? This will help us to help you more effectively.

Also, when you say you want the "whole gene" do you mean the boundaries of the whole gene region on the chromosome (including introns, alternative transcripts, UTRs)? This is going to be tens of kilobases for most genes.

ADD REPLYlink written 5 weeks ago by MaxF70

Hello Sir,

Thank yo for taking time in reading my question. Let's say I have identified Adora1(-) and Cflar(+) as DEGs. Next, I want to make a promoter bed file of these genes with 2kb length, 1kb upstream and 1kb downstream from TSS. Thus, I want to have a script which shows subtraction from chr start and chr end that is 1 1kb upstream from TSS and 1kb downstream from TSS. Since the genes have different signs, then Adora(-) chr start site is column3 and end site at column2, and the list show show 5 values for Adora(-) since it has 5 promoter regions:chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199218 134235052 Adora1(-) For Cplar, the start site should be column2 while end site should be column1, since it has a positive sign, and 11 values since it has 11 promoter regions. chr1 58711306 58758884 Cflar(+) chr1 58711306 58758884 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711598 58758884 Cflar(+) chr1 58712172 58758884 Cflar(+) chr1 58712445 58758884 Cflar(+) chr1 58713181 58758884 Cflar(+) chr1 58713183 58758884 Cflar(+) chr1 58713285 58733227 Cflar(+) chr1 58713285 58759209 Cflar(+) I hope this clarifies better. Thank you in advance

ADD REPLYlink written 5 weeks ago by genomics_student0

I'm not sure if you were trying to format your lines in some specific way, but the website didn't insert any line breaks so it's hard to understand.

I know how to make a bedfile like you're describing using python, but I don't know Perl very well.

Both python and perl have bio-libraries that will come in handy for this kind of thing, especially if you eventually want fasta files from them.

In either language, you should be able to use a simple if - else statement like below (i haven't checked this code, it's just a suggested starting point).

for line in inputFile:
    lineParsed = line.rstrip().split("\t") #removes whitespace from the end of your line and then splits it on the tab character
    chrom = line[0]
    5pSite = line[1]
    3pSite = line[2]
    name = line[3].split("(")[0]  #this gets the gene name
    sign = line[3].split("(")[1][0] #this should get the sign, + or -
    if (sign == "+"):
        kbUp = 5pSite + 1000
        kbDown = 3pSite - 10000
    elif (sign == "-"):
        kbUp = 3pSite - 1000
        kbDown = 5pSite + 10000
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by MaxF70

Good day Sir,

Thank you again for your reply. I understood what you're saying and tried to write it in a perl script: module load perl

open (IN, "mm10_NCBI-RefSeq.bed"); open (OUT, ">DEGs_promoters.bed");

while (<in>) { @data = split(/\t/,$_); @strand = split(/(|)/,$data[-1]); if ($strand[1] eq "+") {$start = $data[1];} else {$start = $data[2];}

    $pos = "$data[0]:$start:$strand[1]";
    if ($check{$pos} == 1) {next;}
    $check{$pos}=1;
    if ($strand[-2] eq "+") {
            $s = $data[1] + 1000;
            $e = $data[2] - 1000;
    } elsif ($strand[-2] eq "-") {
            $s = $data[1] - 1000;
            $e = $data[2] + 1000;
    }

}

close (IN);

open (CSV, "genes.csv"); @gene = split(/\t/,$_); $filter_genes = $gene[0]; if $filter_gene = $data[-1]; print OUT "$data[0]\t$s\t$e\t$data[-1]";

close (CSV); close (OUT);

My next problem is I have to make sure that I only get an output bed file of promoters only for my genes of interest, contained in column 1 or data[0] of my csv file. I cannot figure out how to program it. Thank you again.

ADD REPLYlink written 5 weeks ago by genomics_student0

You lost code formatting for the last few lines. Try writing out your code in a text editor, pasting it into the text box, highlighting the whole thing and then pushing the "code" button in the formatting bar (looks like 1010101).

For your specific question, you just need to make a variable that contains only your genes of interest. If you have a text file with the genes you want, you can read that in and add those values to a set or dictionary (this is very easy to do in python, you'll have to figure out how to do it in perl).

Once you've got the genes of interest in there, you can just add an if statement. Here's how I'd do it in python:

genesOfInterest = set()

listOfGenes = open("File_name_for_gene_list.txt","r")
for line in listOfGenes:
    genesOfInterest.add(line.rstrip())

for line in bedFile:
    parseLine = line.someParsingFunction()
    geneName = parseLine[X]
    if (geneName in genesOfInterest):
        writeOutput
    else:
        continue
ADD REPLYlink written 5 weeks ago by MaxF70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 802 users visited in the last hour