UniVocity CSV parser does varying length ? - parsing

I have a 26 million rows dataset and when I try parsing it with uniVocity parser it reads it as 18 million rows only.
My rows field count varies from 158 to 162 with delimiter as ASCII '\u0001'.
wc -l output from linux >>>> wc -l withHeader.dat
26351323 withHeader.dat
But parser reads it as Total # of rows in file = 18554088 ( output from list.size of parser.parseAll() )
Can some one explain what could be the issue ?
this is my parserSettings
settings.getFormat().setLineSeparator("\n");
settings.selectFields("acctId","tcat", "transCode");
settings.getFormat().setDelimiter('\u0001');
//settings.setAutoConfigurationEnabled(true);
//settings.setMaxColumns(86);
settings.setHeaderExtractionEnabled(false);
// creates a CSV parser
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(newReader(filePath));
System.out.println("Total # of rows in file = " + allRows.size());

If your values can contain line separators, then the number of parsed records won't be equal to the number of lines.
If that's not the case, then it's likely you are not configuring the format correctly. You might need to configure quotes, quote escapes, etc.
My first suggestion is to try to detect the format automatically with:
settings.detectFormatAutomatically();
After parsing, check if you got the row count you expect to find. You can get what has been detected by calling:
CsvFormat detectedFormat = parser.getDetectedFormat();
Keep in mind this process is not guaranteed to work but in the majority of cases it does the trick. These features are available as of version 2.0.0.
If nothing helps, please attach (part of) your input file so I can take a look and update my answer.

Related

How to make a variable non delimited file to be a delimited one

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

Lua Pattern Matching issue

I'm trying to parse a text file using lua and store the results in two arrays. I thought my pattern would be correct, but this is the first time I've done anything of the sort.
fileio.lua:
questNames = {}
questLevels = {}
lineNumber = 1
file = io.open("results.txt", "w")
io.input(file)
for line in io.lines("questlist.txt") do
questNames[lineNumber], questLevels[lineNumber]= string.match(line, "(%a+)(%d+)")
lineNumber = lineNumber + 1
end
for i=1,lineNumber do
if (questNames[i] ~= nil and questLevels[i] ~= nil) then
file:write(questNames[i])
file:write(" ")
file:write(questLevels[i])
file:write("\n")
end
end
io.close(file)
Here's a small snippet of questlist.txt:
If the dead could talk16
Forgotten soul16
The Toothmaul Ploy9
Well-Armed Savages9
And here's a matching snippet of results.txt:
talk 16
soul 16
Ploy 9
Savages 9
What I'm after in results.txt is:
If the dead could talk 16
Forgotten soul 16
The Toothmaul Ploy 9
Well-Armed Savages 9
So my question is, which pattern do I use in order to select all text up to a number?
Thanks for your time.
%a matches letters. It does not match spaces.
If you want to match everything up to a sequence of digits you want (.-)(%d+).
If you want to match a leading sequence of non-digits then you want ([^%d]+)(%d+).
That being said if all you want to do is insert a space before a sequence of digits then you can just use line:gsub("%d+", " %0", 1) to do that (the one to only do it for the first match, leave that off to do it for every match on the line).
As an aside I don't think io.input(file) is doing anything useful for you (or what you might expect). It is replacing the default standard input file handle with the file handle file.

introduce parsing loop / refactor ugly code

I am writing a script that reads from a binary file, converts to ASCII, extracts/delimits 2 columns, and pipes it out to a txt.
I looked at this post to implement the binary > ASCII step, but, in the way that it is implemented in my script, it seems to only perform the above process on the first row in the file.
How would I re-write this to loop through all rows in the file?
My code is below.
# run the command script to extract the file
script.cmd
# Read the entire file to an array of bytes.
$bytes = [System.IO.File]::ReadAllBytes("filePath")
# Decode first 'n' number of bytes to a text assuming ASCII encoding.
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|
# only keep columns 0-22; 148-149; separate with comma delimiter
%{ "$($_[$0..22] -join ''),$($_[147..147] -join '')"} |
# convert the file to .txt
set-content path\file.txt
Also, what is a more elegant way of writing this part so it just reads the length of the string, instead of pulling in up to 999999 bytes?
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|
You don't need to specify index and count. Simply use
[System.Text.Encoding]::ASCII.GetString($bytes).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
or
[System.Text.Encoding]::ASCII.GetString([System.IO.File]::ReadAllBytes("filePath")).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
I'm not sure why you would want to read it as bytes, when you could simply use Get-Content.

parsing a text file where each record spans more than 1 line

I need to parse a text file that contains hundreds of records that span more than 1 line each. I'm new to Python and have been trying to do this with grep and awk in several complex ways but no luck yet.
The file contains records that look like this:
409547095517 911033 00:47:41 C44 00:47:46 D44 00:47:53 00:47:55
(555) 555-1212 00:47 10/31 100 Main Street - NW
Some_City TX 323 WRLS METRO PCS
P# 122-5217 ALT# 555-555-1212 LEC:MPCSI WIRELESS CALL Q
UERY CALLER FOR LOCATION QUERY CALLER FOR PHONE #*
Really I can do all I need to if I could just get these multi-line records condensed to 1 line per record. Each record will always begin with "40" or I could let 9110 indicate start as these will always be there and are unqiue providing 40 is at begining of line. I used a HEX editer and found that I could remove all line feeds (hex 0D0A) but this is not better than manually editing the files and programaticaly I'd need to not remove the last one per record. Some records will be only 2 lines but most will be 5 like this one.
Is there a way python or otherwise to concatonate the lines that make up a record into one line where 40 or maybe better choice where 9110 indicates the start of the record?
Any ideas or pointers will be much appreciated. I've got python and a good IDE and I'm good with grep and find but learning awk (don't laugh)...
awk will do it. You need to identify The line that starts a record. In this case it is 409547095517
So let's assume that to be safe if a line starts with 8 numbers it is the start of a record.
awk ' NR> 1 && /^[0-9]{8}/ { printf("\n") }
{printf("%s", $0) }
END{ printf("\n") }' filename > newfilename
Change the {8} to any number that works for you.

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Resources