Join Similar CSV Using Perl
When exporting logs in Check Point using the smart log feature, sometimes the column order is randomized. If you have multiple files and need to join them, especially if they have different column orders, you don’t want to do this by hand.
1 Million Record Limit
You have a limit of 1 Million records when you export log files using the Check Point Smart Log application. Excel similarly limits file sizes to 1 Million records. However, many times you will find multiple CSV files stored as flat text databases and you need to parse them and they will total more than 1 Million records.
Text::CSV_XS to Join CSVs
If you want to join multiple CSV together by hand, you would have to open each CSV, make sure the column order was correct, rearrange bad columns, then paste them. Excel has a limit of 1 Million records. What happens if your CSV file has 15 million records? This Perl script was written to automate this tedious puzzle without the use of Excel.
Join CSV Files with Different Columns
The first version simply opened a set of files and then appended them to a new file, skipping any headers after the first one was found. This version uses a hash to map the columns then rewrites all CSV files to the output format you specify.
Error Checking on CSV
This Perl script does not check the validity of your CSV, and trusts Text::CSV_XS to do that job. If there are encoding errors, it might choke. For the purposes of joining multiple Check Point Log exports into 1 single CSV, it performs very well. You could easily add a few lines that would remove commas or massage the data as needed if that was required. Luckily, it wasn’t required for this Perl script or the data sets we were using.
Would you like Custom Perl Scripting for your projects? We can help you parse data and extract the information you want, just like this Perl script does:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
#!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; use 5.010; use Benchmark; # start timer my $t0 = new Benchmark; # output name my $outfile = $ARGV[0] // 'X'; if (length $outfile < 3) { die "Need a file name for 'output'\n"; } # only print header once my $header = 0; # find our list of csvs first, then loop over them my @csvs = glob "./*.csv"; foreach my $in (@csvs) { my $t1 = new Benchmark; print "FILE: $in\n"; # this is what it should be # we push into array of what is actually is. #$[0] is proper order #$[1] is just a counter for seen/not #$[2] will be pushed with what input file is my %base = ( 'Time' => ['A',0,0], 'Description' => ['B',1,0], 'Destination' => ['K',10,0], 'Policy Name' => ['Q',16,0], 'Protocol' => ['T',19,0], 'Source Port' => ['W',22,0], 'Destination Port' => ['X',23,0], 'Service Name' => ['Z',25,0], 'Source' => ['AA',26,0], 'ICMP Type' => ['AE',30,0], ); # don't want to type up a manual map. This does enough cols to match # extra columns are ignored my %map; my $abc = 0; foreach my $abc1 ('A'..'Z') { $map{$abc1} = $abc; $abc++; } foreach my $abc1 ('A'..'Z') { my $k = 'A'.$abc1; $map{$k} = $abc; $abc++; } foreach my $abc1 ('A'..'Z') { my $k = 'B'.$abc1; $map{$k} = $abc; $abc++; } foreach my $abc1 ('A'..'Z') { my $k = 'C'.$abc1; $map{$k} = $abc; $abc++; } my %rmap = reverse %map; my %in; # maps the keys on the inbound file # get headers of each input file my @fields; my $csv = Text::CSV_XS->new ( {binary => 1, auto_diag => 1 }); # new CSV object in binary mode open my $fh, "<:encoding(utf8)", $in or die "$!\n"; # open file handle ($fh) for reading (<:) while (my $row = $csv->getline ($fh)) { @fields = @$row; last; } close $fh; # all fields, raw, even ones we don't want # show me where the ones I want are, and do they match my %base? my $map = 0; # loop over all, even bad ones foreach my $f (@fields) { # if it's a keeper, do something if (exists $base{$f}) { # but only keep the first one found, if ($base{$f}[2] > 0) { next } # not the first one, so, put in our list and keep the $map id too push @{$base{$f}},$map; # tracks if it's first or not $base{$f}[2]++; } # what column are we on? This is the "input column" $map++; } # %base isn't sorted like we want, so sort by something we like my @keys = sort { $base{$a}[1] <=> $base{$b}[1] } keys %base; # ok, verify rewrite of our keeper fields, now in order foreach my $f (@keys) { print "BASE: $base{$f}[0] $base{$f}[1]\tIN:$base{$f}[3]\t$f\n"; } # we have the proper map now in $base{$f}[3], open CSV to print out # and print it out properly, append mode, utf8 encoding open my $fh3, ">>:encoding(utf8)", $outfile or die "$!\n"; # open CSV to read in, loop line by line and print to fh my $csv2 = Text::CSV_XS->new ( { binary => 1, allow_loose_quotes => 1, # required for lines that have quotes randomly inside auto_diag => 1 }); # new CSV object in binary mode open my $fh2, "<:encoding(utf8)", $in or die "$!\n"; # open file handle ($fh) for reading (<:) while (my $row = $csv2->getline ($fh2)) { # skip header, if it's been tripped if ($header == 1){ $header++; next} # gather info, mapped my $time = $row->[$map{$rmap{$base{'Time'}[3]}}]; my $desc = $row->[$map{$rmap{$base{'Description'}[3]}}]; my $dest = $row->[$map{$rmap{$base{'Destination'}[3]}}]; my $policy = $row->[$map{$rmap{$base{'Policy Name'}[3]}}]; my $proto = $row->[$map{$rmap{$base{'Protocol'}[3]}}]; my $sport = $row->[$map{$rmap{$base{'Source Port'}[3]}}]; my $dport = $row->[$map{$rmap{$base{'Destination Port'}[3]}}]; my $service = $row->[$map{$rmap{$base{'Service Name'}[3]}}]; my $source = $row->[$map{$rmap{$base{'Source'}[3]}}]; my $icmp = $row->[$map{$rmap{$base{'ICMP Type'}[3]}}] // " "; # sometimes the field is blank and gives obnoxious error # print out to output print $fh3 qq~"$time","$desc","$dest","$policy","$proto","$sport","$dport","$service","$source","$icmp"\n~; } close $fh2; close $fh3; # header is 1 after every try, then goes past 1, then reset to 1 on each new open $header = 1; my $t2 = new Benchmark; my $td = timediff($t2,$t1); print "FILE DONE!! took '",timestr($td), "' seconds\n"; } my $t1 = new Benchmark; my $td = timediff($t1,$t0); print " ALL DONE!! took '",timestr($td), "' seconds\n"; |