In a now migrated question about human-readable URLs I allowed myself to elaborate a little hobby-horse of mine:
When I encounter URLs like http://www.example.com/product/123/subpage/456.html I always think that this is an attempt on creating meaningful hierarchical URLs which, however, is not entirely hierarchical. What I mean is, you should be able to slice off one level at a time. In the above, the URL has two violations on this principle:
/product/123 is one piece of information represented as two levels. It would be more correctly represented as /product:123 (or whatever delimiter you like)
/subpage is very likely not an entity in itself (i.e., you cannot go up one level from 456.html as http://www.example.com/product/123/subpage is "nothing").
Therefore, I find the following more correct:
http://www.example.com/product:123/456.html
Here, you can always navigate up one level at a time:
http://www.example.com/product:123/456.html
— The subpage
http://www.example.com/product:123 — The product page
http://www.example.com/ — The root
Following the same philosophy, the following would make sense [and provide an additional link to the products listing]:
http://www.example.com/products/123/456.html
Where:
http://www.example.com/products/123/456.html — The subpage
http://www.example.com/products/123 — The product page
http://www.example.com/products — The list of products
http://www.example.com/ — The root
My primary motivation for this approach is that if every "path element" (delimited by /) is selfcontained1, you will always be able to navigate to the "parent" by simply removing the last element of the URL. This is what I (sometimes) do in my file explorer when I want to go to the parent directory. Following the same line of logic the user (or a search engine / crawler) can do the same. Pretty smart, I think.
On the other hand (and this is the important bit of the question): While I can never prevent that a user tries to access a URL he himself has amputated, am I wrongfully asserting (and honouring) that a search engine might do the same? I.e., is it reasonable to expect that no search engine (or really: Google) would try to access http://www.example.com/product/123/subpage (point 2, above)? (Or am I really only taking the human factor into account here?)
This is not a question about personal preference. It's techical question about what I can expect of an crawler / indexer and to what extend I should take non-human URL manipulation into account when designing URLs.
Also, the structural "depth" of http://www.example.com/product/123/subpage/456.html is 4, where http://www.example.com/products/123/456.html is only 3. Rumour has it that this depth influences search engine ranking. At least, so I was told. (It is now evident that SEO is not what I know most about.) Is this (still?) true: does the hierarchical depth (number of directories) influence search ranking?
So, is my "hunch" technically sound or should I spend my time on something else?
Example: Doing it (almost) right
Good ol' SO gets this almost right. Case in point: profiles, e.g., http://stackoverflow.com/users/52162:
http://stackoverflow.com/users/52162 — Single profile
http://stackoverflow.com/users — List of users
http://stackoverflow.com/ — Root
However, the canonical URL for the profile is actually http://stackoverflow.com/users/52162/jensgram which seems redundant (the same end-point represented on two hierarchical levels). Alternative: http://stackoverflow.com/users/52162-jensgram (or any other delimiter consistently used).
1) Carries a complete piece of information not dependent on "deeper" elements.
Hierarchical urls of this kind "http://www.example.com/product:123/456.html" are as useless as "http://www.example.com/product/123/subpage", because when users see your urls, they don't care about identifiers from your database, they want meaningful paths. This is why StackOverflow puts question titles into urls: "http://stackoverflow.com/questions/4017365/human-readable-urls-preferably-hierarchical-too".
Google advices against practice of replacing usual queries like "http://www.example.com/?product=123&page=456", because when every site develops it's own scheme, crawler doesn't know what each part means, if it's important or not. Google has invented sophisticated mechanisms to find important arguments and ignore unimportant, which means you'll get more pages into index and there will be less duplicates. But these algorithms often fail when web developers invent their own scheme.
If you care about both users and crawlers you should use urls like this instead:
http://www.example.com/products/greatest-keyboard/benefits — the subpage
http://www.example.com/products/greatest-keyboard — the product page
http://www.example.com/products — the list of products
http://www.example.com/ — the root
Also, search engines give higher rating to pages with keywords in the url.
Related
I am following a course titled Natural Language Processing on Coursera, and while the course is informative, I wonder if the contents given cater to what am I looking for.Basically I want to implement a textual version of Cortana, or Siri for now as a project, i.e. where the user can enter commands for the computer in natural language and they will be processed and translated into appropriate OS commands. My question is
What is generally sequence of steps for the above applications, after processing the speech? Do they tag the text and then parse it, or do they have any other approach?
Under which application of NLP does it fall? Can someone cite me some good resources for same? My only doubt is that what I follow now, shall that serve any important part towards my goal or not?
What you want to create can be thought of as a carefully constrained chat-bot, except you are not attempting to hold a general conversation with the user, but to process specific natural language input and map it to specific commands or actions.
In essence, you need a tool that can pattern match various user input, with the extraction or at least recognition of various important topic or subject elements, and then decide what to do with that data.
Rather than get into an abstract discussion of natural language processing, I'm going to make a recommendation instead. Use ChatScript. It is a free open source tool for creating chat-bots that recently took first place in the Loebner chat-bot competition, as it has done so several times in the past:
http://chatscript.sourceforge.net/
The tool is written in C++, but you don't need to touch the source code to create NLP apps; just use the scripting language provided by the tool. Although initially written for chat-bots, it has expanded into an extremely programmer friendly tool for doing any kind of NLP app.
Most importantly, you are not boxed in by the philosophy of the tool or limited by the framework provided by the tool. It has all the power of most scripting languages so you won't find yourself going most of the distance towards your completing your app, only to find some crushing limitation during the last mile that defeats your app or at least cripples it severely.
It also includes a large number of ontologies that can jump-start significantly your development efforts, and it's built-in pre-processor does parts-of-speech parsing, input conformance, and many other tasks crucial to writing script that can easily be generalized to handle large variations in user input. It also has a full interface to the WordNet synset database. There are many other important features in ChatScript that make NLP development much easier, too many to list here. It can run on Linux or Windows as a server that can be accessed using a TCP-IP socket connection.
Here's a tiny and overly simplistic example of some ChatScript script code:
# Define the list of available devices in the user's household.
concept: ~available_devices( green_kitchen_lamp stove radio )
#! Turn on the green kitchen lamp.
#! Turn off that damn radio!
u: ( turn _[ on off ] *~2 _~available_devices )
# Save off the desired action found in the user's input. ON or OFF.
$action = _0
# Save off the name of the device the user wants to turn on or off.
$target_device = _1
# Launch the utility that turns devices on and off.
^system( devicemanager $action $target_device )
Above is a typical ChatScript rule. Your app will have many such rules. This rule is looking for commands from the user to turn various devices in the house on and off. The # character indicates a line is a comment. Here's a breakdown of the rule's head:
It consists of the prefix u:. This tells ChatScript a rule that the rule accepts user input in statement or question format.
It consists of the match pattern, which is the content between the parentheses. This match pattern looks for the word turn anywhere in the sentence. Next it looks for the desired user action. The square brackets tell ChatScript to match the word on or the word off. The underscore preceding the square brackets tell ChatScript to capture the matched text, the same way parentheses do in a regular expression. The ~2 token is a range restricted wildcard. It tells ChatScript to allow up to 2 intervening words between the word turn and the concept set named ~available_devices.
~available_devices is a concept set. It is defined above the rule and contains the set of known devices the user can turn on and off. The underscore preceding the concept set name tells ChatScript to capture the name of the device the user specified in their input.
If the rule pattern matches the current user input, it "fires" and then the rule's body executes. The contents of this rule's body is fairly obvious, and the comments above each line should help you understand what the rule does if fired. It saves off the desired action and the desired target device captured from the user's input to variables. (ChatScript variable names are preceded by a single or double dollar-sign.) Then it shells to the operating system to execute a program named devicemanager that will actually turn on or off the desired device.
I wanted to point out one of ChatScript's many features that make it a robust and industrial strength NLP tool. If you look above the rule you will see two sentences prefixed by a string consisting of the characters #!. These are not comments but are validation sentences instead. You can run ChatScript in verify mode. In verify mode it will find all the validation sentences in your scripts. It will then apply each validation sentence to the rule immediately following it/them. If the rule pattern does not match the validation sentence, an error message will be written to a log file. This makes each validation sentence a tiny, easy to implement unit test. So later when you make changes to your script, you can run ChatScript in verify mode and see if you broke anything.
I discovered a use case for matching Request urls using an expression that ignores part of a url path (path not ignoreSearch).
The use case is for an image processing service used in a responsive design where the dimensions of the image are encoded in the url path. This is sort of common among these kinds of services (Cloudinary, Firesize, even Lorempixel).
I noticed every once in a while, one of the dimension components will request will be off by one pixel. The required dimensions are calculated from the client - source of the error is rounding here - But the service worker cache could be an elegant solution for this variation.
However, this rounding problem results in a cache miss because I can't specify that part of the url path can be ignored.
Will url expression matching ever become part of the spec?
In general, is it ok that the 'fetch with url A, cache put/match with url B' pattern grow?
I get that the work around for this is the same as the current work around for ignoreSearch (until its implementation), wherein you fetch with one url but cache with another. I'm just wondering if url path expression matching will ever become part of the spec, or if an url expression matching use case has been considered. I don't see any evidence of this in the authoritative spec.
Thanks in advance for any words of insight.
It might be considered at some point in the future if it becomes a dominant pattern in many applications. Usually if something is fairly common it'll eventually be included in the standard so it can be made faster and more reliable. I wouldn't count on it happening anytime soon though and without many libraries having support for it.
I know that PO / MO files are meant to be used for small strings like button names, labels, etc. Not long text like an About page, etc.
But lately I am encountering a lot of situations that are in the middle. For example, a two sentence call to action. Or a short paragraph.
Is there best practice or "rule of thumb" for when a string is too long to put in a PO file?
update
For "long" text I use partials and include the correct language version. My question is WHEN is it optimal to use one vs the other. I've heard that PO files are "inefficient" for "long" pieces of text. But what does that mean and when is it too "long"? Or is this not a concern?
Use one entry for a self-contained chunk of text; e.g. a sentence as you say.
Two sentences that belong together and don't make sense without each other should be one entry. Why? Because otherwise the translator wouldn't have the context necessary to translate it well. Same goes for a short paragraph, e.g. explaining a setting: if it's inseparable in the code, it should be one entry.
If you encounter a situation where you have lots of long texts regularly (e.g. entire pages or paragraphs of pages), that's usually a sign that you are using an ill-fitting tool. Some people do it, using Gettext for entire articles, but you're better off having separate documents in such cases. But that doesn't seem to be the case here.
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb
Let say I want to refer to a restaurant page, I could use one of those 2 URLs for example:
/restaurants/123
/restaurants/Pizzeria-Mamma
URL 1 has the advantage to be a quick match because of the ID but it is not as descriptive as URL 2.
Does URL matter to search engines? I read somewhere that it is good to put the keywords in the URL too so URL 2 would be the way to go. Can someone confirm or deny?
Ultimately, search engine algorithms are designed to reward good usability (although obviously in practice that is not always the case). As a user, it would semantically make more sense to have the pizzeria name in the url, and you have the added advantage of it being easier to remember. Whilst kbrimington's comment is correct that page content it probably more important, SEOMoz, a search engine algorithm authority, puts keyword in the url at somewhere between the 9th and 11th most important ranking factor depending on where it appears in the url
http://www.seomoz.org/article/search-ranking-factors#ranking-factors
At number 5 in the ranking factors, however, is anchor text; it's only an opinion, but I'm inclined to say that having the word "pizzeria" in the url is more likely to encourage people to put "pizzeria" in the anchor text when they link to your site.