I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob
Related
how can I extract a few words separated by symbols in a string so that nothing is extracted if the symbols change?
for example I wrote this code:
function split(str)
result = {};
for match in string.gmatch(str, "[^%<%|:%,%FS:%>,%s]+" ) do
table.insert(result, match);
end
return result
end
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
my_status={}
status=split(str)
for key, value in pairs(status) do
table.insert(my_status,value)
end
print(my_status[1]) --
print(my_status[2]) --
print(my_status[3]) --
print(my_status[4]) --
print(my_status[5]) --
print(my_status[6]) --
print(my_status[7]) --
output :
busy
MPos
-750.222
900.853
1450.808
2
10
This code works fine, but if the characters and text in the str string change, the extraction is still done, which I do not want to be.
If the string change to
str = "Hello stack overFlow"
Output:
Hello
stack
over
low
nil
nil
nil
In other words, I only want to extract if the string is in this format: "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
In lua patterns, you can use captures, which are perfect for things like this. I use something like the following:
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
local status, mpos1, mpos2, mpos3, fs1, fs2 = string.match(str, "%<(%w+)%|MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)%|FS:(%d+),(%d+)%>")
print(status, mpos1, mpos2, mpos3, fs1, fs2)
I use string.match, not string.gmatch here, because we don't have an arbitrary number of entries (if that is the case, you have to have a different approach). Let's break down the pattern: All captures are surrounded by parantheses () and get returned, so there are as many return values as captures. The individual captures are:
the status flag (or whatever that is): busy is a simple word, so we can use the %w character class (alphanumeric characters, maybe %a, only letters would also do). Then apply the + operator (you already know that one). The + is within the capture
the three numbers for the MPos entry each get (%--%d+%.%d+), which looks weird at first. I use % in front of any non-alphanumeric character, since it turns all magic characters (such as + into normal ones). - is a magic character, so it is required here to match a literal -, but lua allows to put that in front of any non-alphanumerical character, which I do. So the minus is optional, so the capture starts with %-- which is one or zero repetitions (- operator) of a literal - (%-). Then I just match two integers separated by a dot (%d is a digit, %. matches a literal dot). We do this three times, separated by a comma (which I don't escape since I'm sure it is not a magical character).
the last entry (FS) works practically the same as the MPos entry
all entries are separated by |, which I simply match with %|
So putting it together:
start of string: %<
status field: (%w+)
separator: %|
MPos (three numbers): MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)
separator: %|
FS entry (two integers): FS:(%d+),(%d+)
end of string: %>
With this approach you have the data in local variables with sensible names, which you can then put into a table (for example).
If the match failes (for instance, when you use "Hello stack overFlow"), nil` is returned, which can simply be checked for (you could check any of the local variables, but it is common to check the first one.
Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).
Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/
gsub(/^/, "\t" * num)
What character is being substituted?
No character is being substituted, it is just inserting num tabs at the beginning so you could say that it is substituting the zero width "beginning of line" marker. Whoever wrote that would have been better off with something more like this:
tabbed = "\t" * num + original
A regular expression really isn't the right tool for simple string concatenation.
Clarification: If you're expecting your string to contain multiple lines then using:
gsub(/^/, "\t" * num)
to prefix all the lines with tabs is a reasonably thing to do and less noisy than splitting, prefixing, and re-joining. If you're only expecting to deal with a single line in your string then simple string concatenation would be the better choice.
^ means "start of line" in regex syntax, so this will insert num tab characters at the beginning of every line. Technically you could say that it replaces the empty string at the start of every line.
Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.
The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.
I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.
. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:
/title:"(.+?)"/
The parentheses are necessary if you want to extract the title text that it matched out of there.
/title:"([^"]*)"/
The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.
I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.
It won't work if the string wraps lines or includes escaped quotes.
In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.
If your escape character was \ then you could write something like this...
/title:"((?:\\"|[^"])+)"/
This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.
I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F
but there is now a plugin for Atom text editor too that does this.