Forum: (Closed) Parsing in perl - separate output to two files
gravatar for eric.oswald
4.4 years ago by
eric.oswald0 wrote:

This may be the wrong webpage to be posting this kind of question since it relates specifically to Perl, but in trying to find an answer to my question, Google showed me several pages on your website, so I thought I'd give it a shot.

Was given "test data" in a .csv file to write a Perl script against - needed to "scrub" the data for it to later be loaded into MySQL. Wrote a Perl script that worked just fine. Then I got the actual data from another dept and it turns out there's quite a few incorrect records in these larger .csv files. These records are missing most of the data, but were given to me as is. As a result, I now need to check if a certain field is null/empty/blank and if so, write it out to an error file. If that field is not empty, I scrub the data of that record and write it out to a different file.

So I threw in an IF/ELSE statement to check for that empty field and ran my script against the new data. It did write out the correct, scrubbed records to my "clean" file, eliminating the "error" records. But somehow I'm getting over 8 million records in my "errors" file when there are only approx 65,000 records in the original data file I'm testing!

Was hoping someone here might have a keen eye as to seeing maybe I have a bracket in the wrong place? Something along those lines?

Here is a sample of four lines from the file with the last line being an "error" file since $fields[28] is empty -

650096571,1,1,used as store paint,14,IFC 8012NP,Standalone-9,3596,56,1/31/2015,80813,A97W01251,,1/16/2015,0.25,0.25,,SW,CUSTOM MATCH,TRUE,O,xts,,,,,,,1568,61006,1,FALSE
650368376,1,3,Tinted Wrong Color,16,IFC 8012NP,01DX8015206,,6,1/31/2015,160720,A87W01151,MATCH,1/31/2015,1,1,ENG,CUST,CUSTOM MATCH,TRUE,O,Ci52,,,,,,,1584,137252,1,FALSE
650175433,3,1,not tinted - e.w.,16,COROB MODULA HF,Standalone-7,,2,1/31/2015,95555,B20W02651,,1/29/2015,3,3,,COMP,CUSTOM MATCH,TRUE,P,xts,,,,,,,1627,68092,5,FALSE
650187016,2,1,checked out under cash ,,,,,,,,,,,,,,,,,,,,,,,,,,,,

And here is my script -


use strict;
use warnings;
use Data::Dumper;
use Time::Piece;

my $filename = '1502_comma.csv';

# Open input file
open my $FH, $filename
  or die "Could not read from $filename <$!>, program halting.";

# Open error handling file
open ( my $ERR_FH, '>', "errorFiles.csv" ) or die $!;

# Read the header line of the input file and print to screen.
chomp(my $line = <$FH>);
my @fields = split(/,/, $line);
print Dumper(@fields), $/;

my @data;

# Read the lines one by one.
while($line = <$FH>) {


# Scrub data of characters that cause scripting problems down the line.
    $line =~ s/[\'\\]/ /g;

# split the fields of each record

    my @fields = split(/,/, $line);

# Check if the storeNbr field is empty.  If so, write record to error file.
    if (!length $fields[28]) {
        print $ERR_FH join (',', @$_), $/ for @data;

# Concatenate the first three fields and add to the beginning of each record
    unshift @fields, join '_', @fields[28..30];

# Format the DATE fields for MySQL
    $_ = join '-', (split /\//)[2,0,1] for @fields[10,14,24,26];

# Scrub colons from the data
    $line =~ s/:/ /g;

# If Spectro_Model is "UNKNOWN", change
    if($fields[22] eq "UNKNOWN"){
        $_ = 'UNKNOW' for $fields[22];

# If tran_date is blank, insert 0000-00-00
    if(!length $fields[10]){
        $_ = '0000-00-00' for $fields[10];

# If init_tran_date is blank, insert 0000-00-00
    if(!length $fields[14]){
        $_ = '0000-00-00' for $fields[14];

# If update_tran_date is blank, insert 0000-00-00
    if(!length $fields[24]){
        $_ = '0000-00-00' for $fields[24];

# If cancel_date is blank, insert 0000-00-00
    if(!length $fields[26]){
        $_ = '0000-00-00' for $fields[26];

# Format the PROD_NBR field by deleting any leading zeros before decimals.
    $fields[12] =~ s/^\s*0\././;

# put the records back
    push @data, \@fields;

close $FH;
close $ERR_FH;

print "Unsorted:\n", Dumper(@data); #, $/;

#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
@data = sort {
    $a->[0] cmp $b->[0] ||
    $a->[14] cmp $b->[14] ||
    $a->[26] cmp $b->[26] ||
    $a->[27] cmp $b-> [27]
} @data;

open my $OFH, '>', '/path/cleaned1502.csv';
print $OFH join(',', @$_), $/ for @data;
close $OFH;

parse forum perl • 2.1k views
ADD COMMENTlink written 4.4 years ago by eric.oswald0

Hello eric.oswald!

We believe that this post does not fit the main topic of this site.

I'm sorry, this is not a Perl forum. Please ask stackoverflow.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink written 4.4 years ago by RamRS30k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1191 users visited in the last hour