Fix Encoding Issues With Perl
Sometimes when working with text files (like log exports), you might encounter weird encoding issues. If you need to fix encoding issues with Perl, you can use a script like this:
This opens all of the files as raw encoding, then converts to ascii. You can check what type of file linux thinks your files are by using the “file” command followed by -i. If it’s binary, it will tell you, if it’s UTF-8 encoded or ASCII encoded, it will tell you.
As always, if you want help with your Perl Programming projects please contact us for a free quote!
Here is the script that pulls out specific weird characters and converts to ascii encoding in Perl:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
#!/usr/bin/perl # fixer: fixes weird chars in a bunch of text files use warnings; # standard best practices use strict; # standard best practices use 5.010; # for the opendir my $dir, "." or die "Cannot open directory: $!"; while ( my $file = readdir($dir) ) { # skip certain types/etc #next if $node =~ /^\./; #next if $node =~ /\.zip$/; #next if $node =~ /\.pl$/; # or just pick a certain type next unless $file =~ /\.csv$/; # renamed because of glob originally, opendir works fine, not required my $renamed = $file; $renamed =~ s/\s+/_/g; rename "$file",$renamed; print "FILE: '$file' ($renamed)\n"; # open the input file open IN, "$renamed" or die $!; # open the output file open OUT, ">${renamed}-" or die $!; # binmode, select encoding binmode(IN, ":raw"); binmode(OUT, ":encoding(ascii)"); # read file line by line foreach (<IN>) { my $line = $_; chomp $line; $line =~ s/\xC2/ /g; $line =~ s/\xA0/ /g; $line =~ s/\x93/ /g; $line =~ s/\x80/ /g; $line =~ s/\xE2/ /g; $line =~ s/\x0a/ /g; $line =~ s/\x0d/ /g; print OUT "$line\r\n"; # make it end in windows format #print OUT "$line\n"; # make it end in linux format } # close file handles close IN; close OUT; } closedir $dir; # 0d \r = CR = carriage return = ASCII code 13 (decimal), 015 (octal), 0d (hex) # 0a \n = LF = line feed = ASCII code 10 (decimal), 012 (octal), 0a (hex) # 0d0a = windows new line # 0a = linux new line #On Windows, the combination of those two control characters, i.e. \r\n, #is used to indicate a newline, while on Linux/Unix, a single \n is used as newline. |