I've seen this in a migration
enable_extension 'uuid-ossp'
as far as I know uuid is a long unique string based on some RFCs, and this enable the db (in this case pg) to have a column type as a uuid
my question is - Why is this type of column needed and not just a string column?
is it to replace the regular integer id column and to have a uuid as the id instead?
is there any advantage to use a uuid as the id instead of just having a string type column contain a uuid?
I was hoping to see some more people chime in here, but I think the idea of the uuid is to replace the id column for a more unique id which is useful especially when you've got a distributed database or are dealing with replication.
Pros:
Easier to merge data
Better scaling when/if you have to move to a distributed system
Avoids Postgres sequence problems which often occur when merging or copying data
You can generate them from other platforms (other than just the database, if you need)
If you're wanting to obfuscate your records (e.g. rather than accessing users/1 (the id) which might prompt a curious user to try users/2 to see if he could access someone else's information since its obvious the sequential nature of the parameter). Obviously there are other ways of dealing with this particular issue however
Cons:
Requires larger key length that typical id
Is usually non-sequential (which can lead to strange behavior if you're ordering on it, which you probably shouldn't be doing generally anyhow)
Harder to reference when troubleshooting (finding by a long UUID rather than an simple integer id)
Here are some more resources which I found valuable:
Peter van Hardenberg's (of Heroku) argument for UUIDs (among other things, this is an amazing presentation and you should watch all of it)... Here's the part on using UUID's rather than ids: http://vimeo.com/61044807#t=15m04s
Jeff Atwood's (formerly of StackOverflow) argument for GUIDs: http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
http://rny.io/rails/postgresql/2013/07/27/use-uuids-in-rails-4-with-postgresql.html
http://blog.crowdint.com/2013/10/09/using-postgres-uuids-as-primary-keys-on-rails.html
It is not necessary to install that extension to use the uuid type. The advantages of using the UUID type in instead of a text type are two. The first is the automatic constraint
select 'a'::uuid;
ERROR: invalid input syntax for uuid: "a"
Second is storage space. UUID only uses 16 bytes while the hex representation takes 33:
select
pg_column_size('0123456789abcdef0123456789abcdef'),
pg_column_size('0123456789abcdef0123456789abcdef'::uuid)
;
pg_column_size | pg_column_size
----------------+----------------
33 | 16
The uuid-ossp extension just adds functions to generate UUID.
Related
Within the scripting language I am implementing, valid IDs can consist of a sequence of numbers, which means I have an ambiguous situation where "345" could be an integer, or could be an ID, and that's not known until runtime. Up until now, I've been handling every case as an ID and planning to handle the check for whether a variable has been declared under that name at runtime, but when I was improving my implementation of a particular bit of code, I found that there was a situation where an integer is valid, but any other sort of ID would not be. It seems like it would make sense to handle this particular case as a parsing error so that, e.g., the following bit of code that activates all picks with a spell level tag greater than 5 would be considered valid:
foreach pick in hero where spell.level? > 5
pick.activate[]
nexteach
but the following which instead compares against an ID that can't be mistaken for an integer constant would be flagged as an error during parsing:
foreach pick in hero where spell.level? > threshold
pick.activate[]
nexteach
I've considered separate tokens, ID and ID_OR_INTEGER, but that means having to handle that ambiguity everywhere I'm currently using an ID, which is a lot of places, including variable declarations, expressions, looping structures, and procedure calls.
Is there a better way to indicate a parsing error than to just print to the error log, and maybe set a flag?
I would think about it differently. If an ID is "just a number" and plain numbers are also needed, I would say any string of digits is a number, and a number might designate an ID in some circumstances.
For bare integer literals (like 345), I would have the tokenizer return maybe a NUMBER token, indicating it found an integer. In the parser, wherever you currently accept ID, change it to NUMBER, and call a lookup function to verify the "NUMBER" is a valid ID.
I might have misunderstood your question. You start by talking about "345", but your second example has no integer strings.
I really like the Freebase and World Bank type providers and I would like to learn more about type providers by writing one on my own. The European Union has an open data program where you can access data through SPARQL/Linked data. Would it be possible to wrap data access to open EU data by means of a type provider or will it be a waste of time trying to figure out how to do it?
Access to EU data is described here: http://open-data.europa.eu/en/linked-data
I think it is certainly possible - I talked with some people who are actually interested in this (and are working on this, but I'm not sure what is the current status). Anyway - I definitely think this is such a broad area that an additional effort would not be a waste of time.
The key problem with writing a type provider for RDF-like data is to decide what to treat as types (what should become a name of a type or a property name) and what should be left as value (returned as a list or key-value pairs). This is quite obvious for WorldBank - names of countries & properties become types (property names) and values become data. But for triple based data set, this is less obvious.
So far, I think there are two approaches:
Additional ontology - require that the data source comes with some additional ontology that specifies what are the keys for navigation. There is something called "facet ontology" which is used on http://mspace.fm and that might be quite interesting.
Parameterization - parameterize the type provider (in some way) and give it a list of relations that should become available at the type level (and you would probably also need to provide some root where to start).
There are definitely other possibilities - and I think having provider for linked data would be really interesting. If you wanted to do this for F# Data, there is a useful page on contributing :-).
Simple question. I have several different models stored in SQL databases. A Table of images records with byte data, large multi-field user data table, ect. All of these models require primary keys. Most beginner tutorials show usage of int for ids. Is this standard and professional. I find it odd to start use int since is variable in length and starts with 1 :S
Sorry for the amateur question, but I couldn't find any adequate materials on the subject via google.
There's nothing implicitly unprofessional about the use of INT or any other integral data type as a primary key or identity column. It just depends on your business needs. In the interest of keeping programming as simple as possible, the INT data type is a fine choice for many business needs:
It is capable of representing about 2.1 billion unique records.
It consumes minimal hard disk space in modern hardware.
It is fast for SELECTs and UPDATEs.
It is fairly ease to refactor up to a larger integral if the number of records threatens to exceed the limits. BIGINT can address more records than you can put in your database. Seriously.
One reason you might not want to use an integral primary key:
You might use a Guid (or UNIQUEIDENTIFIER in SQL Server terms) for global uniqueness. In other words, if you migrate your data to another location, the primary key should still be unique.
Yes, int is industry standard.
Even beyond databases, I rarely see C# code with uint or any of the other variants for representing whole numbers. Occasionally byte is used in arrays. long is used when int may not be big enough to cover all possibilities.
One advantage of always using int is that you can pass id variables around without having to worry about casting between the different integer types.
This is 100% ok, and widely used. Some use longs for primary keys, since their max value is bigger. Though, not necessary in most occasions.
Guid type is also used sometimes as an ID. It has some benefits, like fixed length, global uniqueness and unpredictability. And some issues like lower search performance and it's hard to remember.
We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...
I'm looking for a RAIL way to create a very secure UID that will act as a authentication token.
I had been using UUID but was told they are not secure. I'd like to learn, what is the method of choice these days in ruby/rails 3?
This question is in no way Rails specific.
UUID is not secure for the simple fact that it is a unique identifier and it contains 'constant' parts of a given machine (e.g. it might use the MAC address for a machine), which makes it easier to guess.
If you want 100k+ strings without someone guessing one, you need to be able to distribute your keys across a large key-space. Let me explain:
If you only need 1 key (let's), you might pick 'A'. In a key-space of A-Z you have 1:26 chance of guessing it. Now, if you'd extend your key-space to A-Za-z you have a 1:52 chance of guessing.
Need more security still? Use a longer key: 'AA' 1:2704 chance.
Now, if you'd want to have 2000 keys and use a key length of 2 (e.g. 'AA'), there's a 2000:2704 => 1:1.352 chance someone might guess it. Pretty bad.
So, the key here is to pick a very long key size. With Digest::SHA1 you get 40-character keys (using Hex, with 16 different values per character). That's 1.46150164e48 unique values. Your 100k values should be random enough.
Edit:
With 40-digit HEX SHA1 values you have a 1:461501640000000000000000000000000000000000000000000 chance of guessing one. That takes ages.