Is there a well-defined difference between "normalizing" and "canonicalizing" data? - normalization

I understand canonicalization and normalization to mean removing any non-meaningful or ambiguous parts of of a data's presentation, turning effectively identical data into actually identical data.
For example, if you want to get the hash of some input data and it's important that anyone else hashing the canonically same data gets the same hash, you don't want one file indenting with tabs and the other using spaces (and no other difference) to cause two very different hashes.
In the case of JSON:
object properties would be placed in a standard order (perhaps alphabetically)
unnecessary white spaces would be stripped
indenting either standardized or stripped
the data may even be re-modeled in an entirely new syntax, to enforce the above
Is my definition correct, and the terms are interchangeable? Or is there a well-defined and specific difference between canonicalization and normalization of input data?

"Canonicalize" & "normalize" (from "canonical (form)" & "normal form") are two related general mathematical terms that also have particular uses in particular contexts per some exact meaning given there. It is reasonable to label a particular process by one of those terms when the general meaning applies.
Your characterizations of those specific uses are fuzzy. The formal meanings for general & particular cases are more useful.
Sometimes given a bunch of things we partition them (all) into (disjoint) groups, aka equivalence classes, of ones that we consider to be in some particular sense similar or the same, aka equivalent. The members of a group/class are the same/equivalent according to some particular equivalence relation.
We pick a particular member as the representative thing from each group/class & call it the canonical form for that group & its members. Two things are equivalent exactly when they are in the same equivalence class. Two things are equivalent exactly when their canonical forms are equal.
A normal form might be a canonical form or just one of several distinguished members.
To canonicalize/normalize is to find or use a canonical/normal form of a thing.
Canonical form.
The distinction between "canonical" and "normal" forms varies by subfield. In most fields, a canonical form specifies a unique representation for every object, while a normal form simply specifies its form, without the requirement of uniqueness.
Applying the definition to your example: Have you a bunch of values that you are partitioning & are you picking some member(s) per each class instead of the other members of that class? Well you have JSON values and short of re-modeling them you are partitioning them per what same-class member they map to under a function. So you can reasonably call the result JSON values canonical forms of the inputs. If you characterize re-modeling as applicable to all inputs then you can also reasonably call the post-re-modeling form of those canonical values canonical forms of re-modeled input values. But if not then people probably won't complain that you call the re-modeled values canonical forms of the input values even though technically they wouldn't be.

Consider a set of objects, each of which can have multiple representations. From your example, that would be the set of JSON objects and the fact that each object has multiple valid representations, e.g., each with different permutations of its members, less white spaces, etc.
Canonicalization is the process of converting any representation of a given object to one and only one, unique per object, representation (a.k.a, canonical form). To test whether two representations are of the same object, it suffices to test equality on their canonical forms, see also wikipedia's definition.
Normalization is the process of converting any representation of a given object to a set of representations (a.k.a., "normal forms") that is unique per object. In such case, equality between two representations is achieved by "subtracting" their normal forms and comparing the result with a normal form of "zero" (typically a trivial comparison). Normalization may be a better option when canonical forms are difficult to implement consistently, e.g., because they depend on arbitrary choices (like ordering of variables).
Section 1.2 from the "A=B" book, has some really good examples for both concepts.

Related

Is there a difference between Forth vocabularies and word lists?

As I read "Programming Forth" by Stephen Pelc, the text seems to imply that vocabularies and word lists may be separate things. I have thought that dictionary vocabulary entries have a name field, code field, etc. so having separate word lists does not make sense to me.
Are word lists just a way of talking about the name fields of Forth words or are word lists actual data structures separate from dictionary entries? (Forth language resources are a bit scarce compared to mainstream languages)
Vocabularies and wordlists are basically the same. I can think of two differences:
Wordlists are specified in the Forth standard. Vocabularies are not, but there's nearly a consensus on how they are used.
A vocabulary has a name. A wordlist just has a numeric id.
A third similar concept is a "lexicon". I haven't seen it used as often as the other two, but I think it's yet another variation or a synonym.
A wordlist is a collection of dictionary entries. It could be a linked list, a hash table, or anything else that works for looking up named entries. The dictionary may be partitioned into several wordlists, i.e. a wordlist is a subset of the dictionary.
A wordlist is the ISO 94 Forth standard name of what is generally called a namespace. wikipedia does a good job in explaining this concept. In Forth multiple namespaces can be coupled to the interpreter which means that words in it can be used, and coupling a namespaces is called "adding the wordlist to the search order". Forth is extensible and has the pretty unique feature that new additions may be added to any wordlist, even one that is not in the search order. That namespace is called CURRENT and the actual, momentaneous search order, consisting of one or more namespaces, is called the CONTEXT. Both can be changed anywhere in an interactive session and anywhere in a compilation.
At the time of creation of the standard there was no consensus of how to give a name to a wordlist, so there is only the word WORDLIST that creates an anonymous namespace and leaves its handle on the stack. The word VOCABULARY was widely in use to create a named namespace, but because there was no consensus about it, VOCABULARY is not part of the standard and one cannot use it in portable programs (which is quite cumbersome, so everybody does it anyway and hopes for the best.)
So to summarize. wordlist as a concept is a namespace. WORDLIST creates an anonymous wordslist and is a standard word. VOCABULARY is not a standard word, mostly it will create a wordlist and couple a name and behaviour to it.

There seem to be a lot of ruby methods that are very similar, how do I pick which one to use?

I'm relatively new to Ruby, so this is a pretty general question. I have found through the Ruby Docs page a lot of methods that seem to do the exact same thing or very similar. For example chars vs split(' ') and each vs map vs collect. Sometimes there are small differences and other times I see no difference at all.
My question here is how do I know which is best practice, or is it just personal preference? I'm sure this varies from instance to instance, so if I can learn some of the more important ones to be cognizant of I would really appreciate that because I would like to develop good habits early.
I am a bit confused by your specific examples:
map and collect are aliases. They don't "do the exact same thing", they are the exact same thing. They are just two names for the same method. You can use whatever name you wish, or what reads best in context, or what your team has decided as a Coding Standard. The Community seems to have settled on map.
each and map/collect are completely different, there is no similarity there, apart from the general fact that they both operate on collections. map transform a collection by mapping every element to a new element using a transformation operation. It returns a new collection (an Array, actually) with the transformed elements. each performs a side-effect for every element of the collection. Since it is only used for its side-effect, the return value is irrelevant (it might just as well return nil like Kernel#puts does, in languages like C, C++, Java, C♯, it would return void), but it is specified to always return its receiver.
split splits a String into an Array of Strings based on a delimiter that can be either a Regexp (in which case you can also influence whether or not the delimiter itself gets captured in the output or ignored) or a String, or nil (in which case the global default separator gets used). chars returns an Array with the individual characters (represented as Strings of length 1, since Ruby doesn't have an specific Character type). chars belongs together in a family with bytes and codepoints which do the same thing for bytes and codepoints, respectively. split can only be used as a replacement for one of the methods in this family (chars) and split is much more general than that.
So, in the examples you gave, there really isn't much similarity at all, and I cannot imagine any situation where it would be unclear which one to choose.
In general, you have a problem and you look for the method (or combination of methods) that solve it. You don't look at a bunch of methods and look for the problem they solve.
There'll typically be only one method that fits a specific problem. Larger problems can be broken down into different subproblems in different ways, so it is indeed possible that you may end up with different combinations of methods to solve the same larger problem, but for each individual subproblem, there will generally be only one applicable method.
When documentation states that 2 methods do the same, it's just matter of preference. To learn the details, you should always start with Ruby API documentation

Is there a clear algorithm for sorting/ordering the loops in an X12 file?

Even though loops are kind of a logical concept in X12 (not directly physically represented in the text), every transaction set defines a set of loops that it can contain, including identifiers for the loops and an ordering for them. My question is, what is the rule for sorting loops, generically? Is there a concise set of rules that can be expressed in some code that should be able to take a collection of loops (with known identifiers such as 1000A, 2300BB, etc) and properly sort them?
The context of my question is that I'm working on a general-purpose library that applications will use to construct a model of an X12 document/transaction-set (and write out the text such a model represents). It has objects to represent Elements, Segments, and Loops. Ordering of Segments in a particular Loop is easy, they're dictated by the Implementation Guides. But I'm trying to get Loop ordering (within a Transaction Set) to work generically; that's what I'm asking about
It seems that the general rule is that Loops are ordered based on their identifiers using the numeric portion as the primary sort key, with the alpha portion as the secondary sort key. Of course hierarchical loops contained in others will be placed before and loops following the parent in that sort order (eg: 1000A, 2000A, 2010A, 2010B, 2100, 2300 - where 2010A and 2010B are children of 2000A).
I understand that the spec and Implementation Guides contain all of this info; I'm looking for the all-encompassing rule about loop ordering (not Segment ordering). Is there any concise way to express the rule algorithmically? Is there even a hard-and-fast rule at all?
As I mentioned in my comments, the standard has a loop value. Take a look at my screenshot of the Liaison Dictionary Viewer. The CLM segment has a LOOP value of 100. The segments underneath are children of the CLM segment (extended tag). Any "order" can be defined arbitrarily by the partner, or can be in any (undefined) order provided the data is qualified. But that loop can occur 100 times max and can have repeating segments inside the loop value.
The implementation guide will give you the correct order your partner wants them in. It seems like you're writing your own syntax validation engine though.

Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?
Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.

Rails - Simplifying calculation models & objects

I have asked a few questions about this recently and I am getting where I need to go, but have perhaps not been specific enough in my last questions to get all the way there. So, I am trying to put together a structure for calculating some metrics based on app data, which should be flexible to allow additional metrics to be added easily (and securely), and also relatively simple to use in my views.
The overall goal is that I will be able to have a custom helper that allows something like the following in my view:
calculate_metric(#metrics.where(:name => 'profit'),#customer,#start_date,#end_date)
This should be fairly self explanatory - the name can be substituted to any of the available metric names, and the calculation can be performed for any customer or group of customers, for any given time period.
Where the complexity arises is in how to store the formula for calculating the metric - I have shown below the current structure that I have put together for doing this:
You will note that the key models are metric, operation, operation_type and operand. This kind of structure works ok when the formula is very simple, like profit - one would only have two operands, #customer.sales.selling_price.sum and #customer.sales.cost_price.sum, with one operation of type subtraction. Since we don't need to store any intermediate values, register_target will be 1, as will return_register.
I don't think I need to write out a full example to show where it becomes more complicated, but suffice to say if I wanted to calculate the percentage of customers with email addresses for customers who opened accounts between two dates (but did not necessarily buy), this would become much more complex since the helper function would need to know how to handle the date variations.
As such, it seems like this structure is overly complicated, and would be hard to use for anything other than a simple formula - can anyone suggest a better way of approaching this problem?
EDIT: On the basis of the answer from Railsdog, I have made some slight changes to my model, and re-uploaded the diagram for clarity. Essentially, I have ensured that the reporting_category model can be used to hide intermediate operands from users, and that operands that may be used in user calculations can be presented in a categorised format. All I need now is for someone to assist me in modifying my structure to allow an operation to use either an actual operand or the result of a previous operation in a rails-esqe way.
Thanks for all of your help so far!
Oy vey. It's been years (like 15) since I did something similar to what it seems like you are attempting. My app was used to model particulate deposition rates for industrial incinerators.
In the end, all the computations boiled down to two operands and an operator (order of operations, parentheticals, etc). Operands were either constants, db values, or the result of another computation (a pointer to another computation). Any Operand (through model methods) could evaluate itself, whether that value was intrinsic, or required a child computation to evaluate itself first.
The interface wasn't particularly elegant (that's the real challenge I think), but the users were scientists, and they understood the computation decomposition.
Thinking about your issue, I'd have any individual Metric able to return it's value, and create the necessary methods to arrive at that answer. After all, a single metric just needs to know how to combine it's two operands using the indicated operator. If an operand is itself a metric, you just ask it what it's value is.

Resources