Related
Suppose I have an URL like this https://example.com?page=1 or have https://example.com?text=1. Here what does mean by ?page=1 or ?text=1. Some website like youtube there I can see that they use like https://youtube.com?watch=zcDchec . What does mean it?
Please explain anyone. I need to know this.
That’s the query component (indicated by the first ?).
It’s a common convention to use key=value pairs, separated by &, in the query component. But it’s up to the author what to use. And it’s up to the author to decide what it should mean.
In practice, the query component ?page=1 will likely point to page 1 of something, while ?page=2 will point to page 2, etc. But there is nothing special about this. The author could as well have used the path component, e.g. /page/1 and /page/2.
We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...
why use - instead off _ in url?
Url contain '_' seems like no bad effects.
Underscores are not allowed in a host name. Thus some_place.com is not a valid URL because the host name is not valid. Underscores are permissible in URLS. Thus some-place.com/which_place/ is perfectly legitimate, other concerns aside.
From RFC 1738:
host
[...] Fully qualified domain names take the form as described
in Section 3.5 of RFC 1034 [13] and Section 2.1 of RFC 1123
[5]: a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumerical character and
possibly also containing "-" characters. The rightmost domain
label will never start with a digit, though, which
syntactically distinguishes all domain names from the IP
addresses.
When you read a_long_sentence_with_many_underscores, because you are reading it by letter or word recognition, your eye tracks along the middle of the line, but when you reach an underscore, your eye is more likely to track down a bit and back up for the next word.
When you read a-long-sentence-with-many-dashes, your eye keeps tracking along the same horizon, and by sight, it is easier for your brain to try and ignore them.
Another good reason is that Google and other search engines rank urls that match to search terms higher when the word separator is a dash.
One main reason is that most anchor tags have text-decoration:underline which effectively hides your underscore.
And, a non-tech savvy user wont automatically assume that there is an underscore :)
By the way... it seems several Java network libraries will not be able to interpret a URL correctly when using underscore:
URI uri = URI.create("http://www.google-plus.com/");
System.out.println(uri.getHost()); // prints www.google-plus.com
URI uri = URI.create("http://www.google_plus.com/");
System.out.println(uri.getHost()); // prints null
It's easier to type (at least on my german keyboard) and see.
I read just now in a comment on another question titled Effective Googling for short names
C# isn't bad to Google for at all. It would be a lot harder if it were called M#, by the way.
Why? What am I missing?
It turns out I was somewhat wrong. I had thought that C# just happened to benefit from an understanding of musical keys - a search for "G#" finds plenty of results about the musical key of G#. (This is shown by experimentation, by the way - despite working at Google I don't know anything about the search engine. At least, not on this front.)
However, in this case not only does C# benefit from the musical key side of things, but Google's own help pages explain that C# and other programming languages are special-cased:
Punctuation that is not ignored
Punctuation in popular terms that have
particular meanings, like [ C++ ] or [
C# ] (both are names of programming
languages), are not ignored.
The
dollar sign ($) is used to indicate
prices. [ nikon 400 ] and [ nikon $400
] will give different results.
The
hyphen - is sometimes used as a signal
that the two words around it are very
strongly connected. (Unless there is
no space after the - and a space
before it, in which case it is a
negative sign.)
The underscore symbol
_ is not ignored when it connects two words, e.g. [ quick_sort ].
It would be interesting to know how long it would take a theoretical language "M#" to become searchable... but I'm not going to start speculating on that in a public forum :)
(Note that the Spec# home page comes up as the second link when you search Google for Spec#. At least it's there and pretty prominent though.)
I'll put up my opinion extrapolated from my comment.
As others have suggested, special chars are ignored by Google. But C# may have had a head start in not being ignored (or at least turned into "C") because of the musical note C# which was probably allowed for searches like "Some piece of music in C#". M# would not have benefited such.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is it better convention to use hyphens or underscores in your URLs?
Should it be /about_us or /about-us?
From usability point of view, I personally think /about-us is much better for end-user yet Google and most other websites (and javascript frameworks) use underscore naming pattern. Is it just matter of style? Are there any compatibility issues with dashes?
From Google Webmaster Central
Consider using punctuation in your
URLs. The URL
http://www.example.com/green-dress.html
is much more useful to us than
http://www.example.com/greendress.html.
We recommend that you use hyphens (-)
instead of underscores (_) in your
URLs.
Here are a few points in favor of the dashes:
Dashes are recommended by Google over underscores (source).
Dashes are more familiar to the end user.
Dashes are easier to write on a standard keyboard (no need to Shift).
Dashes don't hide behind underlines.
Dashes feel more native in the context of URLs as they are allowed in domain names.
It's not just dash vs. underscore:
text with spaces
textwithoutspaces
encoded%20spaces%20in%20URL
underscore_means_space
dash-means-space
plus+means+space
camelCase
PascalCase
" quoted text with spaces" (and single quote vs. double quote)
slash/means/space
dot.means.space
Google did not treat underscore as a word separator in the past, which I thought was pretty crazy, but apparently it does now. Because of this history, dashes are preferred. Even though underscores are now permissible from an SEO point of view, I still think that dashes are best.
One benefit is that your average semi-computer-illiterate web surfer is much more likely to be able to type a dash on the keyboard, they may not even know what the underscore is.
This is just a guess, but it seems they picked the one that people most probably wouldn't use in a name. This way you can have a name that includes a hyphenated word, and still use the underbar as a word delimiter, e.g. UseTwo-wayLinks could be converted to use_two-way_links.
In your example, /about-us would be a directory named the hyphenated word "about-us" (if such a word existed, and /about_us would be a directory named the two-word phrase "about us" converted to a single string of non-white characters.
I used to use underscores all the time, now I only use them for parts of a web site that I don't want anyone to directly link, js files, css, ... etc.
From an SEO point of view, dashes seem to be the preferred way of handling it, for a detailed explanation, from the horses mouth http://www.mattcutts.com/blog/dashes-vs-underscores/.
The other problem that seems to occur, more with the general public than programmers, is that when a hyperlink with underscores is underlined, you can't see the underscore. Advanced users will work it out, but Joe Public probably won't.
Still use underscores in code in preference to dashes though - programmers understand them, most other people don't.
Jeff has some thoughts on this: https://blog.codinghorror.com/of-spaces-underscores-and-dashes/
There are drawbacks to both. I would suggest that you pick one and be consistent.
I'm more comfortable with underscores. First of all, they match in with my regular programming experience of variable_names_are_not-subtraction, second of all, and I believe this was mentioned already, words can have hyphens, but they do not ever have underscores. To pick a really stupid example, "Nation-state country" is different from "nation state country". The former translates something like "the land of nation-states" (think "this here is gun country! Best move along, y'hear?"), whereas the latter looks like a list of sometime-synonyms. http://example.com/nation-state-country/ doesn't appear to mean the same as http://example.com/nation-state_country/, and yet, if hyphens are delimiters/"space"s in addition to characters in words, it can. The latter seems more clear as to the actual purpose, whereas the former looks more like that list, if anything.
The SEO guru Jim Westergren tested this back in 2005 from a strict SEO perspective and came to the conclusion that + (plus) was actually the best word delimiter. However, this doesn't seem reasonable and may be due to a bug in the search engines' algorithms. He recommends - (dash) for both readability and SEO.
Underscores replace spaces where whitespace is not allowed. Dashes (hyphens) can be part of a word, thus joining words with hyphens that already include hyphens is ugly/confusing.
Bad:
/low-budget-movies
Good:
/low-budget_movies
I think dash is better from a user perspective and it will not interfere with SEO.
Not sure where or why the underscore convention started.
A little more knowledgeable debate
I prefer dashes on the basis that an underscore might be obscured to an extent by a link underline. Textual URLs are primarily for being recognised at a glance rather than being grammatically correct so the argument for preserving dashes for use in hyphenated words is limited.
Where the accuracy of a textual URL is important is when reading it out to someone, in which case you don't want to confuse an underscore for a space (or vice-versa).
I also find dashes more aesthetically pleasing, if that counts for anything.
For end-user view i prefer "about-us" or "about us" not "about_us"
Personally, I'd avoid using about-us or about_us, and just use about.
Some older web hosting and DNS servers actually have problems parsing underscores for URLs, so that may play a part in conventions like these.
I personally would avoid all dashes and underscores and opt for camelCase or PascalCase if its in code.
The Wikipedia article on camelCase explains a bit of the reasoning behind it's origins. They amount to
Lazy programmers who didn't like
reaching for the _ key
Potential confusion about
readability
The "Alto" keyboard at xerox PARC
that had no underscore key.
If the user is to see the string then I'd do none of the above and use "About us." or "AboutUs" if I had to as camelCase has spread to common usage in some areas such as product names. i.e ThinkPad, TiVo
Spaces are allowed in URL's, so you can just use "/about us" in a link (although that will be encoded to "/about%20us". But be honest, this will always be personal preference, so there is no real answer to be given here.
I would go with the convention that dashes can appear in words, so spaces should be converted to underscores.
Better use . - / as separators, because _ seems not to be a separator.
http://www.sistrix.com/blog/832-how-long-may-a-linktext-be.html