Custom URL format for news in Expression Engine - url

Our site is migrating from MovableType to ExpressionEngine, and there is one small issue we are having. MT uses a date based URL structure, e.g. www.site.com/2012/03/post-title.html, while EE uses a category based structure, e.g. www.site.com/index.php/news/comments/post-title. The issue is that our MT page used Disqus for comments, and as such comments are tied to a specific URL, meaning that we'd lose all of our comments if we were to migrate. I am wondering if there's a way to change the URL structure in EE to match MT's, thus allowing us to keep the comments. Thanks in advance.

Correction: EE uses a Template Group/Template based structure for URLs, not categories - just to clarify.
You've got a couple of options here.
One is to create an .htaccess rule which internally redirects all requests matching YYYY/MM/ to your EE template which displays your posts (say, /news/entry/). I don't know exactly what those rewrite rules would look like off the top of my head, my mod_rewrite-fu is pretty shallow. But it could definitely work.
Another is to export all of your comments from Disqus via their XML export tool, then do a grep-based find and replace using something like BBEdit, replacing all /YYYY/MM/ strings in that file with /news/entry/; delete all of your existing comments on Disqus; then import your newly-modifed XML file.

Related

URL Layout for multi-lingual CMS

So I'm trying to add a new, shiny REST webservice to our CMS. It should follow the REST "roules" pretty closely, so it shall use GET/POST/PUT/DELETE and proper, logical URLs. My design is heavily inspired by the Apigee Best Practices.
The CMS manages a tree of categories, where each category can contain a number of articles. For multi-lingual projects, the categories are duplicated for each language (so any article is uniquely identifier by its ID and the language ID). The structure is the same across all languages, only the positions can vary (so category X contains the categories Y and Z in every language, but Y can be before Z in language 1 and the other way around in language 2). Creating a new article or category always also creates copies in all languages. Deleting works the same way, so an article is always deleted in all languages.
FYI: Languages are identified by their numeric ID or their locale (like en_US).
(Please image a leading /v1 in front of all URIs; I've skipped it because it adds nothing to this question.)`
My current approach is having an URL scheme like this:
GET /articles :( returns all articles in all languages
GET /articles/:id-:langid :) returns a single article in a given language
POST /articles :) creates a new article
PUT /articles/:id-:langid :) update a single article in a given language
DELETE /articles/:id :) delete an article in all languages
But...
How to identify languages?
Currently, the locale each language is assigned does not need to be unique, so two languages can in rare cases have the same locale. Using the ID is guaranteed to be unique.
Using the locale (most likely forced to lowercase) would be nice, cause the URLs are more readable. But this would
maybe lead to projects with overlapping locales.
require clients to first fetch the locales (but I guess a client would otherwise have to fetch the language IDs anyway, so that's kind of okay).
I would like to end up with exactly one solution to this and tend towards using the locale. But is it worth risking overlapping locales?
(You can image langid=X to be interchangeable with locale=xx_xx in the following.)
Should the language be a hierarchy element?
In most cases, clients of the API will want to fetch the articles for a given language instead of all. I could solve this by allowing a GET parameter and offer GET /articles?langid=42. But since is such a common usecase, I would like to avoid the optional parameter and make it explicit.
This would lead to GET /articles/:langid, but this introduces the concept that the language is a hierarchy level inside the API. If I'm going to do so, I would like to make it consistent for the other verbs. The new URL layout would look like this:
GET /articles/:langid :) returns all articles in a given language
GET /articles/:langid/:id :) returns a single article in a given language
POST /articles :( creates a new article in all languages
PUT /articles/:langid/:id :) update a single article in a given language
DELETE /articles/:langid/:id :( delete an article in all languages
or
GET /:langid/articles :) returns all articles in a given language
GET /:langid/articles/:id :) returns a single article in a given language
POST /articles :( creates a new article in all languages
PUT /:langid/articles/:id :) update a single article in a given language
DELETE /:langid/articles/:id :( delete an article in all languages
This leads to...
What to do with global actions?
The above layout has the problem that POST and DELETE work on all languages, but the URL suggests that they work only on one language. So I could modify the layout:
GET /articles/:langid :) returns all articles in a given language
GET /articles/:langid/:id :) returns a single article in a given language
POST /articles :) creates a new article in all languages
PUT /articles/:langid/:id :) update a single article in a given language
DELETE /articles/:id :) delete an article in all languages
This makes for a somewhat confusing layout, as the language level is only sometimes present and one needs to know a lot about the system's interna. On the bright side, this matches the system very well and is very similar to the internal workings.
Sacrifices?
So where do I make sacrifies? Should I introduce alias URLs to have nice URL layouts but do maybe unintended stuff in the backgroud?
I had the exact same problem some time ago and I decided to put local in front of everything. While consuming content it makes much more sense to read URLs like:
/en/articles/1/
/en/authors/2/
/gr/articles/1/
/gr/authors/2/
than
/articles/en/1/
/authors/en/2/
/articles/gr/1/
/authors/gr/2/
In order to solve the "no specific locale" issue I would use a keyword referring to all available locales, or a default locale. So:
/global/articles/
or
/all-locales/articles/
or
/all/articles/
to be honest I like global and all because they make sense reading them.
DELETE /global/articles/:id :) delete an article in all languages
hope I helped
My initial approach would have been to have language as an optional prefix in the URL, e.g. /en_us/articles, /en_us/articles/:id, but if you don't want to 'pollute' your URLs with language identifiers you can use the Accept-language header instead, in the same way that content negotiation is defined in RFC 2616. Microsoft's WebAPI is using this approach for format negotiation.

Efficient design of crawler4J to get data

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:
1. Get sitemap.xml from robots.txt.
2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.
3. Now, get the list of all URL's from sitemap.xml
4. Now, fetch the content for all above URL's
5. If sitemap.xml is also not available, then scan entire website.
Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ???
Please suggest any more good design is available (Assuming no feeds are available)
If so can you please guide me how to do.
Thanks
Venkat
Crawler4J is not able to perform steps 1,2 and 3, however it performs quite well for steps 4 and 5. My advice would be to use a Java HTTP Client such as the one from Http Components
to get the sitemap. Parse the XML using any Java XML parser and add the urls into a collection. Then populate your crawler4j seeds with the list :
for(String url : sitemapsUrl){
controller.addSeed(url);
}
controller.start(YourCrawler, nbthreads);
I have never used crawler4j, so take my opinion with a grain of salt:
I think that it can be done by the crawler, but it looks like you have to modify some code. Specifically, you can take a look at the RobotstxtParser.java and HostDirectives.java. You would have to modify the parser to extract the sitemap and create a new field in the directives to return the sitemap.xml. Step 3 can be done in the fetcher if no directives were returned from sitemap.txt.
However, I'm not sure exactly what you gain by checking the sitemap.txt: it seems to be a useless thing to do unless you're looking for something specific.

how to remove sitecore folder name in the url?

I created a sitecore year/month/day folder structure in the content tree, when i view each article under the folder node, the url could be http://local/landing/year/month/day/article1.aspx, how could I make the url like this: http://local/landing/article1.aspx?
just remove the year/month/day structure in the url.
Is there some function in sitecore like remove or hide special templates in the frontend url ?
Any help , Thanks .
You can do it in 2 ways:
Use IIS 7 Url rewrite module to change the url. This way the url will be rewritten before it gets to sitecore and you don't need to change any code. You can find more info at the iis website
You can create a custom Item resolver and add it to the RequestBegin sitecore pipeline. Alex Shyba wrote about it here.
It sounds like you may have thousands of these items, but even so, you may want to use the built in functionality of Sitecore and consider creating aliases for each of these items. Programmatically creating an the alias on an ItemSaved event or ItemCreated is probably easiest.
As #marto and #seth have said, you can use URL rewriting or aliases to solve this.
There is, however, a drawback to doing this, irrespective of how you choose to do it.
If you have very many items (your structure makes it sound like you may do) then either method will require that the URL is unique. Removing the date structure from the URL means that all items in your landing section will require unique URLs (whether inherited from their item names or by some other means). This can impact on SEO for your site, as authors may have difficulty finding an unused name that is also human readable and good for SEO. It's unlikely you want to use ugly GUIDs in your URLs.
2 options
Change Bucket configuration and the set the required folder structure, bucket configuration can be found in Sitecore.Buckets.config file
Extend GetFromRouteValue Item Resolver and overwrite the ResolveItem() method to get the bucket item.
The default GetFromRouteValue class reference can be found in Sitecore.MVC.config file and replace this with your own customized implementation.
We have implemented with customized routing and getting the exact item if the route path matches.
Thanks,
Jisha

ColdFusion - What's the best URL naming convention to use?

I am using ColdFusion 9.
I am creating a brand new site that uses three templates. The first template is the home page, where users are prompted to select a brand or a specific model. The second template is where the user can view all of the models of the selected brand. The third template shows all of the specific information on a specific model.
A long time ago... I would make the URLs like this:
.com/Index.cfm // home page
.com/Brands.cfm?BrandID=123 // specific brand page
.com/Models.cfm?ModelID=123 // specific model page
Now, for SEO purposes and for easy reading, I might want my URLs to look like this:
.com/? // home page
.com/?Brand=Worthington
.com/?Model=Worthington&Model=TX193A
Or, I might want my URLs to look like this:
.com/? // home
.com/?Worthington // specific brand
.com/?Worthington/TX193A // specific model
My question is, are there really any SEO benefits or easy reading or security benefits to either naming convention?
Is there a best URL naming convention to use?
Is there a real benefit to having a URL like this?
http://stackoverflow.com/questions/7113295/sql-should-i-use-a-junction-table-or-not
Use URLs that make sense for your users. If you use sensible URLs which humans understand, it'll work with search engines too.
i.e. Don't do SEO, do HO. Human Optimisation. Optimise your pages for the users of your page and in doing so you'll make Google (and others) happy.
Do NOT stuff keywords into URLs unless it helps the people your site is for.
To decide what your URL should look like, you need to understand what the parts of a URL are for.
So, given this URL: http://domain.com/whatever/you/like/here?q=search_terms#page-frament.
It breaks down like this:
http
what protocol is used to deliver the page
:
divides protocol from rest of url
//domain.com
indicates what server to load
/whatever/you/like/here
Between the domain and the ? should indicate which page to load.
?
divides query string from rest of url
q=search_terms
Between the ? and the # can be used for a dynamic search query or setting.
#
divides page fragment from rest of the url
page-frament
Between the # and the end of line indicates which part of the page to focus on.
If your system setup lets you, a system like this is probably the most human friendly:
domain.com
domain.com/Worthington
domain.com/Worthington/TX193A
However, sometimes a unique ID is needed to ensure there is no ambiguity (with SO, there might be multiple questions with the same title, thus why ID is included, whilst the question is included because it's easier for humans that way).
Since all models must belong to a brand, you don't need both ID numbers though, so you can use something like this:
domain.com
domain.com/123/Worthington
domain.com/456/Worthington/TX193A
(where 123 is the brand number, and 456 is the model number)
You only need extra things (like /questions/ or /index.cfm or /brand.cfm or whatever) if you are unable to disambiguate different pages without them.
Remember: this part of the URL identifies the page - it needs to be possible to identify a single page with a single URL - to put it another way, every page should have a unique URL, and every unique URL should be a different page. (Excluding the query string and page fragment parts.)
Again, using the SO example - there are more than just questions here, there are users and tags and so on too. so they couldn't just do stackoverflow.com/7275745/question-title because it's not clearly distinct from stackoverflow.com/651924/evik-james - which they solve by inserting /questions and /users into each of those to make it obvious what each one is.
Ultimately, the best URL system to use depends on what pages your site has and who the people using your site are - you need to consider these and come up with a suitable solution. Simpler URLs are better, but too much simplicity may cause confusion.
Hopefully this all makes sense?
Here is an answer based on what I know about SEO and what we have implemented:
The first thing that get searched and considered is your domain name, and thus picking something related to your domain name is very important
URL with query string has lower priority than the one that doesn't. The reason is that query string is associated with dynamic content that could change over time. The search engine might also deprioritize those with query string fearing that it might be used for SPAM and diluting the result of SEO itself
As for using the URL such as
http://stackoverflow.com/questions/7113295/sql-should-i-use-a-junction-table-or-not
As the search engine looks at both the domain and the path, having the question in the path will help the Search Engine and elevate the question as a more relevant page when someone typing part of the question in the search engine.
I am not an SEO expert, but the company I work for has a dedicated dept to managing the SEO of our site. They much prefer the params to be in the URI, rather than in the query string, and I'm sure they prefer this for a reason (not simply to make the web team's job slightly trickier... all though there could be an element of that ;-)
That said, the bulk of what they concern themselves with is the content within and composition of the page. The domain name and URL are insignificant compared to having good, relevant content in a well defined structure.

dynamic seo title for news articles

I have a news section where the pages resolve to urls like
newsArticle.php?id=210
What I would like to do is use the title from the database to create seo friendly titles like
newsArticle/joe-goes-to-town
Any ideas how I can achieve this?
Thanks,
R.
I suggest you actually include the ID in the URL, before the title part, and ignore the title itself when routing. So your URL might become
/news/210/joe-goes-to-town
That's exactly what Stack Overflow does, and it works well. It means that the title can change without links breaking.
Obviously the exact details will depend on what platform you're using - you haven't specified - but the basic steps will be:
When generating a link, take the article title and convert it into something URL-friendly; you probably want to remove all punctuation, and you should consider accented characters etc. Bear in mind that the title won't need to be unique, because you've got the ID as well
When handling a request to anything starting with /news, take the next part of the path, parse it as an integer and load the appropriate article.
Assuming you are using PHP and can alter your source code (this is quite mandatory to get the article's title), I'd do the following:
First, you'll need to have a function (or maybe a method in an object-oriented architecture) to generate the URLs for you in your code. You'd supply the function with the article object or the article ID and it returns the friendly URL with the ID and the friendly title.
Basically function url(Article $article) => URL.
You will also need some URL rewriting rules to remove the PHP script from the URL. For Apache, refer to the mod_rewrite documentation for details (RewriteEngine, RewriteRule, RewriteCond).

Resources