Limit the length of a token - flex-lexer

How to specify an upper limit of a token's length in flex?
E.g. having and identifier consisting of numbers and letters and has to be at most 1024 characters long.

You can do it with
[a-zA-Z0-9]{1,1024}
according to what you said. However it's more likely to be
[a-zA-Z][a-zA-Z0-9]{1,1023}
as you normally need identifiers to start with a letter.
But you may find it better to simply enforce the rule in the action. Otherwise the scanner wil simply chop say a 2048-character identifier in half and return both halves as tokens, which isn't really what you want.
1024 is ridiculously high BTW.

Related

Check if there are any unique tokens left of a certain character length

I have a Post model with the attribute token.
I'm using SecureRandom.urlsafe_base64(length_of_token) to create a token.
The token doesn't need to be unguessable, but must be unique.
I'm starting with tokens 1 character long, and when they are all used up (all 64 combinations), I should move on to have tokens that are 2 characters long.
How should I check if there are any token variations left for tokens that are 3 characters long?
It would be something like :
64**n - Post.count(:token).where('char_length(token) = ?', n)
The first part gives the number of combinations possible and the second the number of records with a token length of n.
But don't forget that your random generator won't necessarily generate the remaining possible combinations. There will be exponentially more and more collisions over the time, therefore I'd strongly discourage this kind of implementation.
Note: The char_length statement is MySQL specific, so depending on your RDBMS, you'll have to adapt this part.

Why is it best to store a telephone number as a string vs. integer?

As the question states, why is it considered best practice to store telephone numbers as strings rather than integers in the telephone_number column?
Not sure I understand the rationale for this. Please help clear this up!
Thanks!
Telephone numbers are strings of digit characters, they are not integers.
Consider for example:
Expressing a telephone number in a different base would render it meaningless
Adding or multiplying two telephone numbers together, or any math operation on a phone number, is meaningless. The result is not another telephone number (except by conicidence)
Telephone numbers are intended to be entered "as-is" into a connected device.
Telephone numbers may have leading zeroes.
Manipulations of telephone numbers, such as adding an area code, are String operations.
Storing the string version of the telephone number makes this clear and unambiguous.
History: On old pulse-encoded dial systems, the code for each digit in a telephone number was sent as the same number of pulses as the digit (or 10 pulses for "0"). That may be why we still use digits to represent the parts of a phone number. See http://en.wikipedia.org/wiki/Pulse_dialing
What Neil Slater said is correct. I would add that there are lots of edge cases where you can't express a telephone number as a number value consistently.
For example, consider these numbers:
011-123-555-1212
+11-123-555-1212
+1 (112) 355-5121 x2
These are all potentially valid phone numbers, but they mean very different things. Yet, in integer form, they are all 111235551212.
If you are going to store the number for display from input, then you must use a string.
However, while it is true that no mathematical operations can be performed on a number that have meaning. Using a number in hashsets and for indexing is quicker than using a string. So provided you can guarantee or homogenise your set of numbers, so they are all consistent, then you may see better performance operating on a number.
For example, in the Telco world, rating calls for a given customer includes a lot of searching on their CLI and in this situation it is faster and cheaper to search by integer. Generally though strings will be fine performance wise, it is only where performance matters and you have multiple searches to perform for a huge range of numbers - i.e. Rating 250 million calls across 2 million lines and 2000 tariffs. In memory rating also gets expensive, so being able to use a 64bit int or uint is cheaper when dealing with these volumes.
Consider these phone numbers for example
099-1234-56789 or +91-8907-687665.
In this case,if the phone_number attribute is of type integer,then it can't accept these values.It should be a string to hold these type of values.So string is always preferred than integer
There is several reasons for this :
Phone numbers often start with a "0" : an integer will remove all leading "0"s
Phone number can have special char : +, (, -, etc. (for exemple : +33 (0)6 12 23 34)
You cannot perform operations on phones : adding phones, for instance, would be meaningless
Phone number may be internationalised, i.e. different format for different people, thus not possible with integers
There might be other reasons, but I guess that's already a fair amount of those :)

How much space does a tab take?

I want to quantify the saving of space I can get by changing the format of a file.
I have a sparse matrix stocked in a text file (30% sparsity). Columns are separated by tabs.
Following an idea in an SO answer, I will change the format to row_id, col_id for the non zero terms only. I know how much space a float takes, but my question is: how much space does a tab take?
CouchDeveloper in his comment is correct. It's impossible to tell from the data you provide.
In a single byte character set encoding you'd save 1 byte per separator from the current ", ".
In a multibyte encoding it'd depend on the way each of those characters is encoded, you could theoretically even lose space. Say a tab is encoded as 4 bytes, a comma and space as 1 each, you'd end up taking 2 more bytes per separator.
Unless you have many separators and relatively very little data, I'd not worry one way or another, it'd be micro optimisation.
If you do, a binary encoding scheme might be more relevant.
1 byte, but significantly less if you're using compression (based on how common they will be, less than a bit on average). Use compression.

regular expression for objective c to check consecutive incremental number

I want to check if the number pin user entered is too simple. 3 cases would fail, repeating numbers, like "1111"; increasing ones like "1234"; decreasing ones like "4321". Is there a regex which could check these restrictions?
Regex can match specific text pattern but it can't understand its context..Yes you want to check for increasing numbers but there's no such thing as increasing pattern in regex.
You can check for repeated numbers using ^(\d)\1+$
But to check for increasing,decreasing numbers you would have to parse the string to int and check if they are in increasing or decreasing order manually using the %,/ operations

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

Resources