What do the numbers "total processed" & "total emitted" mean exactly for an Apex application - stream-processing

Apache Apex application gives out few metrics when an application is running, such as "total processed" and "total emitted". What do these numbers mean exactly? Are they the number of records processed/emitted, till now, by the corresponding operator?
If above assumption is right, then, in case of, say input operator recovery, when the failed input operator is re-launched in a new container, the numbers, "total processed" and "total emitted" for input operator do not match exactly with the number of tuples fed into the input operator. Does the framework send some extra tuples, apart from the actual data tuples, in case of operator recovery?

Related

Is it possible to deal with left recursion in LL by checking the length of the sentential form and discarding infinitely ambiguous grammars?

I understand that an LL parser (no lookahead) cannot deal with left recursive rules due to the fact it will keep on predicting the left recursive non terminal over and over again and would never be able to match.
But what if we adopted the combination of 2 strategies:
We build a table where we assign each non terminal the minimum length of the strings its sublanguage generates.
So, suppose we have a left recursive rule as follows:
A -> As | a
Where s is a string composed of terminals and non terminals.
If either s generates strings with minimum length greater than 0, then even though when faced with the non terminal "A" in a prediction we would be predicting "A s" over and over again, the minimum length of the predicted string would only grow, eventually surpassing the length of the input string and therefore being discarded.
Of course this wouldn't work with grammars that are infinitely ambiguous (like the ones that have loops), leading us to the second strategy:
Each time we make a prediction either the minimum length of the whole sentential form grows or it stays the same, in the case where it stays the same we keep a stack of the leftmost predicted non terminals, and whenever we predict the same non terminal twice, we discard the prediction.
This effectively discards infinitely many derivations, though it preserves the language accepted by the parser.
As a concrete example, suppose the following grammar:
S -> Sa | a
The minimum length of the strings generated by the non terminal S is 1.
Now suppose we're parsing the following input string: aa.
We start by predicting the start symbol, "S".
Using breadth-first search, we then have 2 predictions: "Sa" and "a".
The "a" prediction is discarded since even though it matches the first "a" in the input string it reaches EOL while there is still an "a" left in the input string.
The "Sa" prediction cannot match anything since we have "S" as the "matching token" in the sentential form, thus we make other 2 predictions: "Saa" and "aa".
The "aa" prediction matches the entire input string and thus we have found a parsing.
The "Saa" prediction has minimum length 3 and therefore cannot match the input string and is discarded.
Would this work? Am I missing something?

Check if there are any unique tokens left of a certain character length

I have a Post model with the attribute token.
I'm using SecureRandom.urlsafe_base64(length_of_token) to create a token.
The token doesn't need to be unguessable, but must be unique.
I'm starting with tokens 1 character long, and when they are all used up (all 64 combinations), I should move on to have tokens that are 2 characters long.
How should I check if there are any token variations left for tokens that are 3 characters long?
It would be something like :
64**n - Post.count(:token).where('char_length(token) = ?', n)
The first part gives the number of combinations possible and the second the number of records with a token length of n.
But don't forget that your random generator won't necessarily generate the remaining possible combinations. There will be exponentially more and more collisions over the time, therefore I'd strongly discourage this kind of implementation.
Note: The char_length statement is MySQL specific, so depending on your RDBMS, you'll have to adapt this part.

Why is it best to store a telephone number as a string vs. integer?

As the question states, why is it considered best practice to store telephone numbers as strings rather than integers in the telephone_number column?
Not sure I understand the rationale for this. Please help clear this up!
Thanks!
Telephone numbers are strings of digit characters, they are not integers.
Consider for example:
Expressing a telephone number in a different base would render it meaningless
Adding or multiplying two telephone numbers together, or any math operation on a phone number, is meaningless. The result is not another telephone number (except by conicidence)
Telephone numbers are intended to be entered "as-is" into a connected device.
Telephone numbers may have leading zeroes.
Manipulations of telephone numbers, such as adding an area code, are String operations.
Storing the string version of the telephone number makes this clear and unambiguous.
History: On old pulse-encoded dial systems, the code for each digit in a telephone number was sent as the same number of pulses as the digit (or 10 pulses for "0"). That may be why we still use digits to represent the parts of a phone number. See http://en.wikipedia.org/wiki/Pulse_dialing
What Neil Slater said is correct. I would add that there are lots of edge cases where you can't express a telephone number as a number value consistently.
For example, consider these numbers:
011-123-555-1212
+11-123-555-1212
+1 (112) 355-5121 x2
These are all potentially valid phone numbers, but they mean very different things. Yet, in integer form, they are all 111235551212.
If you are going to store the number for display from input, then you must use a string.
However, while it is true that no mathematical operations can be performed on a number that have meaning. Using a number in hashsets and for indexing is quicker than using a string. So provided you can guarantee or homogenise your set of numbers, so they are all consistent, then you may see better performance operating on a number.
For example, in the Telco world, rating calls for a given customer includes a lot of searching on their CLI and in this situation it is faster and cheaper to search by integer. Generally though strings will be fine performance wise, it is only where performance matters and you have multiple searches to perform for a huge range of numbers - i.e. Rating 250 million calls across 2 million lines and 2000 tariffs. In memory rating also gets expensive, so being able to use a 64bit int or uint is cheaper when dealing with these volumes.
Consider these phone numbers for example
099-1234-56789 or +91-8907-687665.
In this case,if the phone_number attribute is of type integer,then it can't accept these values.It should be a string to hold these type of values.So string is always preferred than integer
There is several reasons for this :
Phone numbers often start with a "0" : an integer will remove all leading "0"s
Phone number can have special char : +, (, -, etc. (for exemple : +33 (0)6 12 23 34)
You cannot perform operations on phones : adding phones, for instance, would be meaningless
Phone number may be internationalised, i.e. different format for different people, thus not possible with integers
There might be other reasons, but I guess that's already a fair amount of those :)

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

How should I present a cost field to the user, and store it in the database?

Right now I have two fields for cost. One for dollars and one for cents. This works, but it is a bit ugly. It also doesn't allow the user to enter the term "free" or "no cost" if they want. But if I only have one field, I might have to make my parser a bit smarter. What do you think?
On the server side, I combine dollars and cents to store them as decimals in my database. Mainly so that I can gather statistics (cost averages, etc.) quickly.
Do you think it is better to store the cost as a string? Then whenever I actually use the cost for stats or other purposes, I would convert it to a decimal at that point. Or am I on the right track?
There is a rule in database design that states that "atomic data" should not be split. By this rule a price, or cost is such an example of atomic data and therefore it should never be split among multiple columns just like you shouldn't split a phone number among multiple columns (unless you really have a very good reason for it - very rare)
Use a DECIMAL data type. Something like DECIMAL(8,3) should work and it's supported by all ANSI SQL compliant database products!
You can consult Joe Celko's "Thinking In Sets" book for a discussion of this topic. See section 1.6.2, pages 21-22.
EDIT -
It seems from your question that you are also concerned with how to accept user's input in a form that resembles the price (xxxx.xx) - hence the two input boxes, for the whole dollars, and the pennies.
I recommend using a single input box and then doing input validation using Regular Expressions to match your format (i.e. something like [0-9]+(.[0-9]{1,3})? would probably work but could be improved). You could then parse the validated string to a Decimal type in your language, or just pass it as a string into your database - SQL will know how to cast it to a DECIMAL type.
Keep the whole cost as decimal. If it's free, then keep the cost as 0. In presentation if cost is zero - write "free" instead of 0.
I generally store the cost as the lowest unit (pennies) and then convert it to whole dollars later.
So a cost of $4.50 gets stored as 450. Free items would be -1 pennies. You could store free things as 0 pennies as well, this gives you the flexibility to use 0 and -1 to mean two slightly different things (free vs no sale?).
It also makes it easier to support countries that don't use cents if you choose to go that route.
As for presenting the data entry field, I personally don't like it when I have to keep switching fields for tiny things (like when they break up phone numbers into 3 fields, or IP addresses into 4). I'd present one field, and let the users type the decimal point in themselves. That way, your users don't have to tab (or click, if they are unfamiliar with tab) to the next field.
Use cents, use 450 for $4.50 this will save you problems that are arising very often
from the fact that floating point operations are not safe. Just try the following expression in irb:
0.4 - 0.3 == 0.1 will return false. All because of floating point representation
innacuracies.
In my models I'm always using:
attr_accessor :price_with_cents
def price_with_cents
self.price/100.00
end
def price\_with\_cents==(num)
self.price = (num.to_f * 100.00).to_i
end
And the name of column is just price and integer type.
I don't have much experience with decimal columns and their representation in ruby (which can be float that is problematic as i've shown at the begining).
Don't allow garbage to make it to your database. If you're expecting a dollar amount on a field, than make sure it's valid before it gets in there. This will allow you to report better on the data and allow simpler formatting on output.
I suggest making this a single field with validation on update or insert.
if field != SpecialFreeTag then
try to convert to decimal
if fail then report to user
otherwise accept value
Use try parse or regular expressions to help with the validation.
I would store the cost as decimal with the scale being no less than 2 and maybe even 3-5. If something is bought in bulk the unit cost could easily include fractions of a cent. Free items have a cost of 0. If the cost is unknown then allow null values also.

Resources