How to make bed file of TSS (approximately 2kb, 1 kb upstream and 1kb downstream) of genes
0
0
Entering edit mode
2.7 years ago

Hello,

I am a beginner in bioinformatics in general, so please excuse my question if it seems easy to you.

I have identified differentially expressed genes using Deseq2. Next, I want to create a bed file containing these genes and around 2kb region of the promoter region, 1kb upstream and 1 kb downstream of the transcription start site using another gene reference file like this.

If the gene is in (-) position, the start site is column3 and the end site is column2, while if a gene is in (+) position, the start site is column2 and the end site is column 3.

Could you help me with the bioinformatics command/script using perl? Thank you very much in advance.

Chr           Column2          Column3      Gene
chr1    8359738 8679095 Sntg1(-)
chr1    8359738 9250479 Sntg1(-)
chr1    8359738 9250481 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9299877 Sntg1(-)
chr1    8359738 9301062 Sntg1(-)
chr1    8359738 9301063 Sntg1(-)
chr1    8359738 9301064 Sntg1(-)
chr1    8359738 9301064 Sntg1(-)
chr1    33453807        33669794        Prim2(-)
chr1    33540151        33669789        Prim2(-)
chr1    33540152        33669789        Prim2(-)
chr1    33540152        33669789        Prim2(-)
chr1    58711306        58758884        Cflar(+)
chr1    58711306        58758884        Cflar(+)
chr1    58711490        58759209        Cflar(+)
chr1    58711490        58759209        Cflar(+)
chr1    58711598        58758884        Cflar(+)
chr1    58712172        58758884        Cflar(+)
chr1    58712445        58758884        Cflar(+)
chr1    58713181        58758884        Cflar(+)
chr1    58713183        58758884        Cflar(+)
chr1    58713285        58733227        Cflar(+)
chr1    58713285        58759209        Cflar(+)
chr1    75485824        75506452        Obsl1(-)
chr1    75485824        75506658        Obsl1(-)
chr1    75485824        75506658        Obsl1(-)
chr1    75486776        75506658        Obsl1(-)
chr1    75486865        75506658        Obsl1(-)
chr1    75492631        75506658        Obsl1(-)
chr1    75496336        75506452        Obsl1(-)
chr1    125676995       125873862       Gpr39(+)
chr1    167688899       167848741       Lmx1a(+)
chr1    167688905       167848741       Lmx1a(+)
chr1    167689557       167848733       Lmx1a(+)
chr1    175962300       176214433       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962300       176276850       Pld5(-)
chr1    175962305       176213942       Pld5(-)
chr1    175962305       176275312       Pld5(-)
chr1    175963071       176274979       Pld5(-)
chr1    175963091       176213932       Pld5(-)
chr1    184527840       184557691       1700112H15Rik(-)
chr1    192885673       193035442       Syt14(-)
chr1    192885673       193035616       Syt14(-)
chr1    192885673       193035616       Syt14(-)
chr1    192891232       192980428       Syt14(-)
chr1    192891232       192986803       Syt14(-)
chr1    192891232       192986803       Syt14(-)

RNA-Seq • 1.7k views
1
Entering edit mode

Also, when you say you want the "whole gene" do you mean the boundaries of the whole gene region on the chromosome (including introns, alternative transcripts, UTRs)? This is going to be tens of kilobases for most genes.

0
Entering edit mode

Hello Sir,

Thank yo for taking time in reading my question. Let's say I have identified Adora1(-) and Cflar(+) as DEGs. Next, I want to make a promoter bed file of these genes with 2kb length, 1kb upstream and 1kb downstream from TSS. Thus, I want to have a script which shows subtraction from chr start and chr end that is 1 1kb upstream from TSS and 1kb downstream from TSS. Since the genes have different signs, then Adora(-) chr start site is column3 and end site at column2, and the list show show 5 values for Adora(-) since it has 5 promoter regions:chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199214 134235457 Adora1(-) chr1 134199218 134235052 Adora1(-) For Cplar, the start site should be column2 while end site should be column1, since it has a positive sign, and 11 values since it has 11 promoter regions. chr1 58711306 58758884 Cflar(+) chr1 58711306 58758884 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711490 58759209 Cflar(+) chr1 58711598 58758884 Cflar(+) chr1 58712172 58758884 Cflar(+) chr1 58712445 58758884 Cflar(+) chr1 58713181 58758884 Cflar(+) chr1 58713183 58758884 Cflar(+) chr1 58713285 58733227 Cflar(+) chr1 58713285 58759209 Cflar(+) I hope this clarifies better. Thank you in advance

0
Entering edit mode

I'm not sure if you were trying to format your lines in some specific way, but the website didn't insert any line breaks so it's hard to understand.

I know how to make a bedfile like you're describing using python, but I don't know Perl very well.

Both python and perl have bio-libraries that will come in handy for this kind of thing, especially if you eventually want fasta files from them.

In either language, you should be able to use a simple if - else statement like below (i haven't checked this code, it's just a suggested starting point).

for line in inputFile:
lineParsed = line.rstrip().split("\t") #removes whitespace from the end of your line and then splits it on the tab character
chrom = line[0]
5pSite = line[1]
3pSite = line[2]
name = line[3].split("(")[0]  #this gets the gene name
sign = line[3].split("(")[1][0] #this should get the sign, + or -
if (sign == "+"):
kbUp = 5pSite + 1000
kbDown = 3pSite - 10000
elif (sign == "-"):
kbUp = 3pSite - 1000
kbDown = 5pSite + 10000

0
Entering edit mode

Good day Sir,

Thank you again for your reply. I understood what you're saying and tried to write it in a perl script: module load perl

open (IN, "mm10_NCBI-RefSeq.bed"); open (OUT, ">DEGs_promoters.bed");

while (<in>) { @data = split(/\t/,$_); @strand = split(/(|)/,$data[-1]); if ($strand[1] eq "+") {$start = $data[1];} else {$start = $data[2];} $pos = "$data[0]:$start:$strand[1]"; if ($check{$pos} == 1) {next;}$check{$pos}=1; if ($strand[-2] eq "+") {
$s =$data[1] + 1000;
$e =$data[2] - 1000;
} elsif ($strand[-2] eq "-") {$s = $data[1] - 1000;$e = $data[2] + 1000; }  } close (IN); open (CSV, "genes.csv"); @gene = split(/\t/,$_); $filter_genes =$gene[0]; if $filter_gene =$data[-1]; print OUT "$data[0]\t$s\t$e\t$data[-1]";

close (CSV); close (OUT);

My next problem is I have to make sure that I only get an output bed file of promoters only for my genes of interest, contained in column 1 or data[0] of my csv file. I cannot figure out how to program it. Thank you again.

0
Entering edit mode

You lost code formatting for the last few lines. Try writing out your code in a text editor, pasting it into the text box, highlighting the whole thing and then pushing the "code" button in the formatting bar (looks like 1010101).

For your specific question, you just need to make a variable that contains only your genes of interest. If you have a text file with the genes you want, you can read that in and add those values to a set or dictionary (this is very easy to do in python, you'll have to figure out how to do it in perl).

Once you've got the genes of interest in there, you can just add an if statement. Here's how I'd do it in python:

genesOfInterest = set()

listOfGenes = open("File_name_for_gene_list.txt","r")
for line in listOfGenes:

for line in bedFile:
parseLine = line.someParsingFunction()
geneName = parseLine[X]
if (geneName in genesOfInterest):
writeOutput
else:
continue