introduce parsing loop / refactor ugly code - parsing

I am writing a script that reads from a binary file, converts to ASCII, extracts/delimits 2 columns, and pipes it out to a txt.
I looked at this post to implement the binary > ASCII step, but, in the way that it is implemented in my script, it seems to only perform the above process on the first row in the file.
How would I re-write this to loop through all rows in the file?
My code is below.
# run the command script to extract the file
script.cmd
# Read the entire file to an array of bytes.
$bytes = [System.IO.File]::ReadAllBytes("filePath")
# Decode first 'n' number of bytes to a text assuming ASCII encoding.
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|
# only keep columns 0-22; 148-149; separate with comma delimiter
%{ "$($_[$0..22] -join ''),$($_[147..147] -join '')"} |
# convert the file to .txt
set-content path\file.txt
Also, what is a more elegant way of writing this part so it just reads the length of the string, instead of pulling in up to 999999 bytes?
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|

You don't need to specify index and count. Simply use
[System.Text.Encoding]::ASCII.GetString($bytes).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
or
[System.Text.Encoding]::ASCII.GetString([System.IO.File]::ReadAllBytes("filePath")).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
I'm not sure why you would want to read it as bytes, when you could simply use Get-Content.

Related

0x85 windows 1252 breaks line if file opened with utf-8 encoding

I have a file with an old format from the 70s used in Companies House (UK company registry).
I inherited a parser written 6 years ago which goes line by line and according to a set of conditions extracts the information from the line and inserts them into a dictionary.
There is a weird character that is breaking a line.
I copied this line to a new file awk '{if(NR==33411) print $0}' PROD216_1950_ew_1.dat > broken and opend broken in vim.
Turns out that weird character is read by vim a <85>.
The result is that everything after MAYFIELD is read as a new line.
Below the line in question:
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD 3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
in vim becomes
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD <85>3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
I am using codecs to read this file with a context manager, which I thought was the way of going about it -
Is there anything I am missing? What is that <85>?
with codecs.open(filepath, 'r', 'utf-8') as fh:
for line in fh:
linetype = determine_line_type(line)
if linetype == 'header':
continue
elif linetype == 'company':
do stuff...
elif linetype == 'officer':
do stuff...
vim shows <85> to indicate a hex 85 byte that is invalid in the current encoding (i.e., the encoding it's using to decode the file).
My guess is that the file's encoding is Windows-1252, in which hex 85 denotes the ellipsis character.
So the solution for your parser might be as simple as changing 'utf-8' to 'cp1252' in the codecs.open call.
After going around for some time here and here I came up with this solution, which works.
with open(filepath, encoding='utf-8') as fh:
for line in fh:
byteline = bytearray(line, encoding='utf-8').replace(b'\xc2\x85', b'')
line_clean = byteline.decode(encoding='utf-8')
# do stuff with clean line.
Knowing that the byte sequence that breaks the string is b'\xc2\x85' (it is interpreted as an ... ellipsis character.
First encode the string to an array of bytes with bytearray, then use replace method of the bytearray class, finally, decode the clean line using the decode method, which will return the string without the weird character from before the transformation.

UniVocity CSV parser does varying length ?

I have a 26 million rows dataset and when I try parsing it with uniVocity parser it reads it as 18 million rows only.
My rows field count varies from 158 to 162 with delimiter as ASCII '\u0001'.
wc -l output from linux >>>> wc -l withHeader.dat
26351323 withHeader.dat
But parser reads it as Total # of rows in file = 18554088 ( output from list.size of parser.parseAll() )
Can some one explain what could be the issue ?
this is my parserSettings
settings.getFormat().setLineSeparator("\n");
settings.selectFields("acctId","tcat", "transCode");
settings.getFormat().setDelimiter('\u0001');
//settings.setAutoConfigurationEnabled(true);
//settings.setMaxColumns(86);
settings.setHeaderExtractionEnabled(false);
// creates a CSV parser
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(newReader(filePath));
System.out.println("Total # of rows in file = " + allRows.size());
If your values can contain line separators, then the number of parsed records won't be equal to the number of lines.
If that's not the case, then it's likely you are not configuring the format correctly. You might need to configure quotes, quote escapes, etc.
My first suggestion is to try to detect the format automatically with:
settings.detectFormatAutomatically();
After parsing, check if you got the row count you expect to find. You can get what has been detected by calling:
CsvFormat detectedFormat = parser.getDetectedFormat();
Keep in mind this process is not guaranteed to work but in the majority of cases it does the trick. These features are available as of version 2.0.0.
If nothing helps, please attach (part of) your input file so I can take a look and update my answer.

Split file into two tempfiles in ruby

I have a large file with 2 different formats separated by a dashed line, how can I split the file into two tempfiles for processing?
Example:
yaml:format
yaml:format
yaml:format
---------
csv,format
csv,format
etc.
split at exactly twelve dashes:
yaml, csv = input.split('------------', 2)
or at a variable number of dashes
yaml, csv = input.split(/^-+$/, 2)
this will produce empty lines around the delimiter (end of yaml and start of csv), if you want to get rid of them you can do
yaml, csv = input.split(/[\r\n]+^-+$[\r\n]+/, 2)

How to convert .txt files to .xls files using informix 4GL codes

I got a question to be disscuss.I am working on INFORMIX 4GL programs. That programs produce output text files.This is an example of the output:
Lot No|Purchaser name|Billing|Payment|Deposit|Balance|
J1006|JAUHARI BIN HAMIDI|5285.05|4923.25|0.00|361.80|
J1007|LEE, CHIA-JUI AKA LEE, ANDREW J. R.|5366.15|5313.70|0.00|52.45|
J1008|NAZRIN ANEEZA BINTI NAZARUDDIN|5669.55|5365.30|0.00|304.25|
J1009|YAZID LUTFI BIN AHMAD LUTFI|3180.05|3022.30|0.00|157.75|
From that output text files(.txt) files, we can open it manually from the excel(.xls) files.From this case, is that any 4gl codes or any commands that we can use it for open the text files in microsoft excell automatically right after we run the program?If there any ideas,please share with me... Thank You
The output shown is in the normal Informix UNLOAD format, using the pipe as a delimiter between fields. The nearest approach to this for Excel is a CSV file with comma-separated values. Generating one of those from that output is a little fiddly. You need to enclose fields containing a comma inside double quotes. You need to use commas in place of pipes. And you might have to worry about backslashes too.
It is a moot point whether it is easier to do the conversion in I4GL or whether to use a program to do the conversion. I think the latter, so I wrote this script a couple of years ago:
#!/usr/bin/env perl
#
# #(#)$Id: unl2csv.pl,v 1.1 2011/05/17 10:20:09 jleffler Exp $
#
# Convert Informix UNLOAD format to CSV
use strict;
use warnings;
use Text::CSV;
use IO::Wrap;
my $csv = new Text::CSV({ binary => 1 }) or die "Failed to create CSV handle ($!)";
my $dlm = defined $ENV{DBDELIMITER} ? $ENV{DBDELIMITER} : "|";
my $out = wraphandle(\*STDOUT);
my $rgx = qr/((?:[^$dlm]|(?:\\.))*)$dlm/sm;
# $csv->eol("\r\n");
while (my $line = <>)
{
print "1: $line";
MultiLine:
while ($line eq "\\\n" || $line =~ m/[^\\](?:\\\\)*\\$/)
{
my $extra = <>;
last MultiLine unless defined $extra;
$line .= $extra;
}
my #fields = split_unload($line);
$csv->print($out, \#fields);
}
sub split_unload
{
my($line) = #_;
my #fields;
print "$line";
while ($line =~ $rgx)
{
printf "%d: %s\n", scalar(#fields), $1;
push #fields, $1;
}
return #fields;
}
__END__
=head1 NAME
unl2csv - Convert Informix UNLOAD to CSV format
=head1 SYNOPSIS
unl2csv [file ...]
=head1 DESCRIPTION
The unl2csv program converts a file from Informix UNLOAD file format to
the corresponding CSV (comma separated values) format.
The input delimiter is determined by the environment variable
DBDELIMITER, and defaults to the pipe symbol "|".
It is not assumed that each input line is terminated with a delimiter
(there are two variants of the UNLOAD format, one with and one without
the final delimiter).
=head1 EXAMPLES
Input:
10|12|excessive|cost \|of, living|
20|40|bou\\ncing tigger|grrrrrrrr|
Output:
10,12,"excessive","cost |of, living"
20,40,"bou\ncing tigger",grrrrrrrr
=head1 RESTRICTIONS
Since the csv2unl program does not know about binary blob data, it
cannot convert such data into the hex-encoded format that Informix
requires.
It can and does handle text blob data.
=head1 PRE-REQUISITES
Text::CSV_XS
=head1 AUTHOR
Jonathan Leffler <jleffler#us.ibm.com>
=cut
I generate Excel files from 4GL code by writing a XML with the Excel progid ("?mso-application progid=\"Excel.Sheet\"?) so Excel opens it as such.
Its like writing HTML from 4GL, you just wite HTML code to file. But with Excel you write XML.

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

Resources