Enrichment Analysis Of Transcription Factors
4
2
Entering edit mode
8.7 years ago
Assa Yeroslaviz ★ 1.5k

Hi everybody,

I have a huge text file in this form:

Inspecting sequence ID   chrI:846767-847266

V$ELK1_01             |      254 (+) |  1.000 |  0.885 | aaaacaGGAAGcagga
I$UBX_01              |      142 (+) |  1.000 |  0.992 | ggggcgtTAATGggttact
I$KR_01               |      150 (+) |  1.000 |  0.975 | aatGGGTTac
I$HB_01               |      478 (+) |  1.000 |  0.992 | gcaaAAAAAa
V$E2F_01              |      378 (-) |  1.000 |  0.846 | aatTTTTCgcccaaa
V$E2F_01              |      468 (-) |  1.000 |  0.835 | tttTTTTCgggcaaa
I$HSF_01              |      200 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      210 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      306 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01              |      490 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01              |       30 (+) |  1.000 |  1.000 | AGAAC

Inspecting sequence ID   chrI:1344697-1345196

V$ATF_01               |      157 (+) |  1.000 |  0.976 | ttgTGACGtcagca
V$ATF_01               |      157 (-) |  1.000 |  0.963 | ttgtgaCGTCAgca
V$ATF_01               |      326 (+) |  1.000 |  0.967 | tgcTGACGtcacat
V$ATF_01               |      326 (-) |  1.000 |  0.977 | tgctgaCGTCAcat
I$HSF_01               |      150 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |      213 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |      343 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01               |      174 (-) |  1.000 |  1.000 | GTTCT
F$HSF_01               |      274 (-) |  1.000 |  1.000 | GTTCT
V$CREBP1CJUN_01        |      160 (+) |  1.000 |  1.000 | tGACGTca
V$CREBP1CJUN_01        |      160 (-) |  1.000 |  1.000 | tgACGTCa

Inspecting sequence ID   chrI:2689476-2689975

I$HB_01                |      368 (+) |  1.000 |  0.984 | ccaaAAAAAa
V$RSRFC4_01            |      254 (+) |  1.000 |  0.906 | agtCTATTtttaattt
I$HSF_01               |        3 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |       34 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |       96 (+) |  1.000 |  1.000 | AGAAA
V$COMP1_01             |       77 (+) |  1.000 |  0.866 | ccacttGATTGacggaaggagaaa
V$PAX6_01              |      153 (-) |  1.000 |  0.760 | ttcccagcattCGTGAacgtt
V$GFI1_01              |      270 (+) |  1.000 |  0.994 | ttttttcaAATCAcagcaactgag

...

For each of the chromosomal positions I would like to count the occurrences for each of the motifs on the left side.

The results should be something like:

chrI:846767-847266:
V$ELK1_01 -  1
I$UBX_01 -  1
V$E2F_01 - 2
I$HSF_01 - 4 
...

I would like to count the occurrences for each of the motifs in each of the positions and calculate these enrichment of these occurrences against the complete list of motifs.

I wrote a perl script to split the big file into separate single files for each position and a second script to count the occurrences in each of the files ( this one needs to be ran separately for each of the single position files (which is unfortunately very ineffective).

Is there a way to parse this huge txt file in R to calculate the numbers for motifs per position?

I would appreciate any help you can give me.

thanks

Assa

parsing • 1.7k views
ADD COMMENT
0
Entering edit mode

Am I right to assume this is an output you created from Transfac Match? How did you get it in tab-delimited text like the above format?

ADD REPLY
0
Entering edit mode

Yes, this is the command line output of match. A tab-delimited format was easy enough with a text editor.

ADD REPLY
0
Entering edit mode

Thanks! I'll try it on the command line, was using the online client and it gives a poorly formatted output.

ADD REPLY
3
Entering edit mode
8.7 years ago
JC 12k

in perl:

 #!/usr/bin/perl
 # countMotifs.pl

 use strict;
 use warnings;
 my %counts = ();

 while (<>) {
       chomp;
       if (m/^Inspecting sequence ID\s+(.+)/) {
             my $id = $1;
             while (my ($motif, $total) = each %counts) {
                  print "$motif - $total\n";
             }
             print $id, ":\n";
             %counts = (); # restart counts
       }
       elsif (m/(.+?)\s/) {
            $counts{$1}++;
       }
       else {
             # empty lines
        }
  }

Then you can parse line per line your file, the time depends how fast do you read the file (a few seconds in many cases).

perl countMotifs.pl < raw.data > counts.txt

and can work with compressed files:

gunzip -c raw.data.gz | perl countMotifs.pl > counts.txt
ADD COMMENT
2
Entering edit mode
8.7 years ago
cl.parallel ▴ 140

for kicks, in awk :)

$ awk '{
            if ($1 == "Inspecting") {
                if (header) {
                    print header; 
                    for (i in counts) 
                        print i" - "counts[i] 
                    delete counts; 
                }
                header=$4;
            } else if ($1 != "") {
                counts[$1]++
            }
          }; 
          END {
                  print header; 
                  for (i in counts) 
                      print i" - "counts[i]
         }' in_file.txt > out_file.txt

or python:

import sys

def gen_block(stream):
    block = []
    header = stream.readline().strip().split()[-1]                                                                 
    for line in stream:                                                                                            
        line = line.strip()                                                                                        
        if line.startswith('Inspecting'):                                                                          
            yield header, block                                                                                    
            header = line.split()[-1]                                                                              
            block = []                                                                                             
        elif len(line):                                                                                            
            block.append(line.split()[0].strip())                                                                  
    yield header, block

def count_by_block(filename=None):                                                                                 
    o_file = sys.stdin if not filename else open(filename)                                                         
    for header, block in gen_block(o_file):                                                                        
        print header                                                                                               
        for c in set(block): 
            print c, '-', block.count(c)                                                                           

if __name__ == '__main__':
    count_by_block((sys.argv[1] if len(sys.argv) == 2 else None))

in either, you may use zcat or gzip -cd to stream in a compressed file.

ADD COMMENT
1
Entering edit mode
8.7 years ago

Here is a pretty dirty python script you can use on your huge text file (not on the single files):

import sys

inFile = open(sys.argv[1],'r')

block = []
header = inFile.next().strip().split()[-1]
for line in inFile:
    if line.strip() != '':
        data = line.strip().split()
        if data[0] == "Inspecting":
            print header
            count = {}
            for d in block:
                if not count.has_key(d[0]):
                    count[d[0]] = 1
                else:
                    count[d[0]] += 1

            for motif, num in count.items():
                print motif + "\t" + str(num)
            block = []
            header = data[-1]
        else:
            block.append(data)

print header
count = {}
for d in block:
    if not count.has_key(d[0]):
        count[d[0]] = 1
    else:
        count[d[0]] += 1

for motif, num in count.items():
    print motif + "\t" + str(num)

save as yourName.py and use it by: python yourName.py yourHugeTextFile.txt

ADD COMMENT
1
Entering edit mode
8.7 years ago
Woa ★ 2.8k

Another Perl hack, but as this 'slurps' the input file for a big file it may crash or become too slow

use strict;
use warnings;
my $pattern="Inspecting sequence ID   ";
my $filename="mydata.txt";#Big data file
my @chunks=();
open FH,$filename||die($!);
{
    local $/ = undef;
    @chunks = split(/$pattern/m, <FH>);
}
close FH||die($!);
shift @chunks;
foreach my $i(@chunks){
    my @lines=grep{defined $_}split(/\n+/,$i);
    my $chr=shift(@lines);
    my %motif_count=();
    map{my ($motif)=split(/\s+/,$_);$motif_count{$motif}++;}@lines;
    print $chr,"\n";
    map{print $_," - ",$motif_count{$_},"\n";}keys %motif_count;
    print "\n";
}

Output:

chrI:846767-847266
I$HB_01 - 1
I$UBX_01 - 1
I$KR_01 - 1
V$ELK1_01 - 1
V$E2F_01 - 2
F$HSF_01 - 1
I$HSF_01 - 4

chrI:1344697-1345196
V$ATF_01 - 4
V$CREBP1CJUN_01 - 2
F$HSF_01 - 2
I$HSF_01 - 3

chrI:2689476-2689975
I$HB_01 - 1
V$GFI1_01 - 1
V$COMP1_01 - 1
V$RSRFC4_01 - 1
V$PAX6_01 - 1
I$HSF_01 - 3
ADD COMMENT
0
Entering edit mode

Note that there's a formatting error. The line would read as: @chunks = split(/$pattern/m, < FH >);

ADD REPLY

Login before adding your answer.

Traffic: 2362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6