Question: Enrichment Analysis Of Transcription Factors
2
gravatar for Assa Yeroslaviz
6.7 years ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

Hi everybody,

I have a huge text file in this form:

Inspecting sequence ID   chrI:846767-847266

V$ELK1_01             |      254 (+) |  1.000 |  0.885 | aaaacaGGAAGcagga
I$UBX_01              |      142 (+) |  1.000 |  0.992 | ggggcgtTAATGggttact
I$KR_01               |      150 (+) |  1.000 |  0.975 | aatGGGTTac
I$HB_01               |      478 (+) |  1.000 |  0.992 | gcaaAAAAAa
V$E2F_01              |      378 (-) |  1.000 |  0.846 | aatTTTTCgcccaaa
V$E2F_01              |      468 (-) |  1.000 |  0.835 | tttTTTTCgggcaaa
I$HSF_01              |      200 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      210 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      306 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01              |      490 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01              |       30 (+) |  1.000 |  1.000 | AGAAC

Inspecting sequence ID   chrI:1344697-1345196

V$ATF_01               |      157 (+) |  1.000 |  0.976 | ttgTGACGtcagca
V$ATF_01               |      157 (-) |  1.000 |  0.963 | ttgtgaCGTCAgca
V$ATF_01               |      326 (+) |  1.000 |  0.967 | tgcTGACGtcacat
V$ATF_01               |      326 (-) |  1.000 |  0.977 | tgctgaCGTCAcat
I$HSF_01               |      150 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |      213 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |      343 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01               |      174 (-) |  1.000 |  1.000 | GTTCT
F$HSF_01               |      274 (-) |  1.000 |  1.000 | GTTCT
V$CREBP1CJUN_01        |      160 (+) |  1.000 |  1.000 | tGACGTca
V$CREBP1CJUN_01        |      160 (-) |  1.000 |  1.000 | tgACGTCa

Inspecting sequence ID   chrI:2689476-2689975

I$HB_01                |      368 (+) |  1.000 |  0.984 | ccaaAAAAAa
V$RSRFC4_01            |      254 (+) |  1.000 |  0.906 | agtCTATTtttaattt
I$HSF_01               |        3 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |       34 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |       96 (+) |  1.000 |  1.000 | AGAAA
V$COMP1_01             |       77 (+) |  1.000 |  0.866 | ccacttGATTGacggaaggagaaa
V$PAX6_01              |      153 (-) |  1.000 |  0.760 | ttcccagcattCGTGAacgtt
V$GFI1_01              |      270 (+) |  1.000 |  0.994 | ttttttcaAATCAcagcaactgag

...

For each of the chromosomal positions I would like to count the occurrences for each of the motifs on the left side.

The results should be something like:

chrI:846767-847266:
V$ELK1_01 -  1
I$UBX_01 -  1
V$E2F_01 - 2
I$HSF_01 - 4 
...

I would like to count the occurrences for each of the motifs in each of the positions and calculate these enrichment of these occurrences against the complete list of motifs.

I wrote a perl script to split the big file into separate single files for each position and a second script to count the occurrences in each of the files ( this one needs to be ran separately for each of the single position files (which is unfortunately very ineffective).

Is there a way to parse this huge txt file in R to calculate the numbers for motifs per position?

I would appreciate any help you can give me.

thanks

Assa

parsing • 1.4k views
ADD COMMENTlink written 6.7 years ago by Assa Yeroslaviz1.2k

Am I right to assume this is an output you created from Transfac Match? How did you get it in tab-delimited text like the above format?

ADD REPLYlink written 6.6 years ago by UnivStudent380

Yes, this is the command line output of match. A tab-delimited format was easy enough with a text editor.

ADD REPLYlink written 6.6 years ago by Assa Yeroslaviz1.2k

Thanks! I'll try it on the command line, was using the online client and it gives a poorly formatted output.

ADD REPLYlink written 6.6 years ago by UnivStudent380
3
gravatar for JC
6.7 years ago by
JC7.6k
Mexico
JC7.6k wrote:

in perl:

 #!/usr/bin/perl
 # countMotifs.pl

 use strict;
 use warnings;
 my %counts = ();

 while (<>) {
       chomp;
       if (m/^Inspecting sequence ID\s+(.+)/) {
             my $id = $1;
             while (my ($motif, $total) = each %counts) {
                  print "$motif - $total\n";
             }
             print $id, ":\n";
             %counts = (); # restart counts
       }
       elsif (m/(.+?)\s/) {
            $counts{$1}++;
       }
       else {
             # empty lines
        }
  }

Then you can parse line per line your file, the time depends how fast do you read the file (a few seconds in many cases).

perl countMotifs.pl < raw.data > counts.txt

and can work with compressed files:

gunzip -c raw.data.gz | perl countMotifs.pl > counts.txt
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by JC7.6k
2
gravatar for cl.parallel
6.7 years ago by
cl.parallel140
Seattle, WA
cl.parallel140 wrote:

for kicks, in awk :)

$ awk '{
            if ($1 == "Inspecting") {
                if (header) {
                    print header; 
                    for (i in counts) 
                        print i" - "counts[i] 
                    delete counts; 
                }
                header=$4;
            } else if ($1 != "") {
                counts[$1]++
            }
          }; 
          END {
                  print header; 
                  for (i in counts) 
                      print i" - "counts[i]
         }' in_file.txt > out_file.txt

or python:

import sys

def gen_block(stream):
    block = []
    header = stream.readline().strip().split()[-1]                                                                 
    for line in stream:                                                                                            
        line = line.strip()                                                                                        
        if line.startswith('Inspecting'):                                                                          
            yield header, block                                                                                    
            header = line.split()[-1]                                                                              
            block = []                                                                                             
        elif len(line):                                                                                            
            block.append(line.split()[0].strip())                                                                  
    yield header, block

def count_by_block(filename=None):                                                                                 
    o_file = sys.stdin if not filename else open(filename)                                                         
    for header, block in gen_block(o_file):                                                                        
        print header                                                                                               
        for c in set(block): 
            print c, '-', block.count(c)                                                                           

if __name__ == '__main__':
    count_by_block((sys.argv[1] if len(sys.argv) == 2 else None))

in either, you may use zcat or gzip -cd to stream in a compressed file.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by cl.parallel140
1
gravatar for Damian Kao
6.7 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Here is a pretty dirty python script you can use on your huge text file (not on the single files):

import sys

inFile = open(sys.argv[1],'r')

block = []
header = inFile.next().strip().split()[-1]
for line in inFile:
    if line.strip() != '':
        data = line.strip().split()
        if data[0] == "Inspecting":
            print header
            count = {}
            for d in block:
                if not count.has_key(d[0]):
                    count[d[0]] = 1
                else:
                    count[d[0]] += 1

            for motif, num in count.items():
                print motif + "\t" + str(num)
            block = []
            header = data[-1]
        else:
            block.append(data)

print header
count = {}
for d in block:
    if not count.has_key(d[0]):
        count[d[0]] = 1
    else:
        count[d[0]] += 1

for motif, num in count.items():
    print motif + "\t" + str(num)

save as yourName.py and use it by: python yourName.py yourHugeTextFile.txt

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Damian Kao15k
1
gravatar for Woa
6.7 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Another Perl hack, but as this 'slurps' the input file for a big file it may crash or become too slow

use strict;
use warnings;
my $pattern="Inspecting sequence ID   ";
my $filename="mydata.txt";#Big data file
my @chunks=();
open FH,$filename||die($!);
{
    local $/ = undef;
    @chunks = split(/$pattern/m, <FH>);
}
close FH||die($!);
shift @chunks;
foreach my $i(@chunks){
    my @lines=grep{defined $_}split(/\n+/,$i);
    my $chr=shift(@lines);
    my %motif_count=();
    map{my ($motif)=split(/\s+/,$_);$motif_count{$motif}++;}@lines;
    print $chr,"\n";
    map{print $_," - ",$motif_count{$_},"\n";}keys %motif_count;
    print "\n";
}

Output:

chrI:846767-847266
I$HB_01 - 1
I$UBX_01 - 1
I$KR_01 - 1
V$ELK1_01 - 1
V$E2F_01 - 2
F$HSF_01 - 1
I$HSF_01 - 4

chrI:1344697-1345196
V$ATF_01 - 4
V$CREBP1CJUN_01 - 2
F$HSF_01 - 2
I$HSF_01 - 3

chrI:2689476-2689975
I$HB_01 - 1
V$GFI1_01 - 1
V$COMP1_01 - 1
V$RSRFC4_01 - 1
V$PAX6_01 - 1
I$HSF_01 - 3
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Woa2.7k

Note that there's a formatting error. The line would read as: @chunks = split(/$pattern/m, < FH >);

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Woa2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1046 users visited in the last hour