Find out various layouts used in website - html-parsing

Is it possible to find out the total no of layouts (templates) used within a website.
For example:-
Suppose i want to know how many types of layouts www.flipkart.com uses.
Answer will be like:-
Landing page or Home page
Category Page e.g http://www.flipkart.com/mobiles?_l=GIuT6NCRsZbfL9ID9ZKHNQ--&_r=hCno5y6eFUI8C0iWzaQbAg--&ref=cef19a11-4ebc-4f8e-a0dc-401c2d55de3e&_pop=brdcrumb
This is a category page. All such pages will have same layout only the inner content will be different.
Product Pages like http://www.flipkart.com/htc-sensation-mobile-phone/p/itmczbrsnwphgbnw?pid=MOBCYW9HXBUDYJPH&_l=sXQjsX87GxqrvKzhjuOrkw--&_r=n_2yuAC4xgh0SZTuulvAtw--&ref=9305103f-6fc1-497c-807a-8f30ee30c13c is a product page.
All the product pages will have same layout like they have buy now option. Multiple images will be there. So Is there any existing tool to find out this.
I hope i am clear in my question. I just want to classify the site pages into some buckets.

Well I don't think there exists some kind of tool or algorithm now upto my knowledge but yes you can write some. Try to find out some attributes of these pages and set them as benchmarks. Now whenever you encounter a url and you want to identify its category just find out the attributes again and compare against the benchmarks set.
Its not generic though but will work for specific websites :)

Related

Pros and Cons of using hierarchical URLs versus flat?

I'm building a large news site and we'll have several thousand articles. So far we have over 20,000. We plan on having a main menu which contains links which will display articles based on those criteria. Therefore, clicking "baking" will show all articles related to "baking", and "baking/cakes" will show everything related to cakes.
Right now, we're weighing whether or not to use hierarchical URLs for each article. If I'm on the "baking/cakes" page, and I click an article that says "Chocolate Raspberry Cake", would it be best to put that article at a specific, hierarchical URL like this:
website.com/baking/cakes/chocolate-raspberry-cake
or a generic, flat one like this:
website.com/articles/chocolate-raspberry-cake
What are the pros and cons of doing each? I can think of cases for each approach, but I'm wondering what you think.
Thanks!
It really depends on the structure of your site. There's no one correct answer for every site.
That being said, here's my recommendation for a news site: instead of embedding the category in the URL, embed the date. For example: website.com/article/2016/11/18/chocolate-raspberry-cake or even website.com/2016/11/18/chocolate-raspberry-cake. This allows you to write about Chocolate Raspberry Cake more than once, as long as you don't do it on the same day. When I'm browsing news I find it helpful to identify the date an article was written as quickly as possible; embedding it in the URL is very helpful.
Hierarchical URLs based on categories lock you into a single category for each article, which may be too limiting. There may be articles which fit multiple categories. If you've set up your site to require each article to have a single primary category, then this may not be an issue for you.
Hierarchical URLs based on categories can also be problematic if any of the categories ever change. For example, in the case of typos, changes to pluralization, a new term coming into vogue and replacing an existing term, or even just a change in wording (e.g. "baking" could become "baked goods"). The terms as they existed when you created the article will be forever immortalized in your URL structure, unless you retroactively change them all (invalidating old links, so make sure to use Drupal's Redirect module).
If embedding the date in the URL is not an option, then my second choice would be the flat URL structure because it will give you URLs which are shorter and easier to remember. I would recommend using "article" instead of "articles" in the URL because it saves you a character.

Making a website without hyperlinks

I am making a simple CMS to use in my own blog and I have been using the following code to display articles.
xmlhttp.onreadystatechange=function(){
if (xmlhttp.readyState==4 && xmlhttp.status==200){
document.getElementById("maincontent").innerHTML=xmlhttp.responseText;
}
}
What it does is it sends a request to the database and gets the content associated with the article that was clicked on and writes it to the main viewing area with ".innerHTML".
Thus I don't have actual links to other articles. I know that I can use PHP to output HTML so that it forms a link like :
<a href=getcontent.php?q=article+title>Article Title</a>
But being slightly OCD I wanted my output to be as neat as possible. Although search engine visibility is not a concern for my personal blog I intend to adapt this to a few other sites which have search engine optimization as a priority.
From what I understand, basically search engine robots follow links to index the web sites.
My question is:
Does this practice have any negative implications for search engine visibility? Also; are there other reasons for preferring one approach over the other as I see that almost every site uses the 'link' method.
The link you've written will cause a page reload. In order to leverage the standard AJAX stuff you've got at the top, you need to write the links as something along the lines of
Article Title
This assumes you have a javascript function called ajaxGet that takes an argument of the identifier for the article you're searching for.
If you were to write your entire site that way, search engines wouldn't be able to crawl you at all since they don't execute javascript. Therefore they can't get to anything off the front page. Also, even if they could follow the links, they'd have no way of referencing the page they got to since it doesn't have a unique URL. This is also annoying for users, since they can't get a link to an exact story to bookmark, send to a friend etc.

Reg. Search engine optimization for my blog

I'm on the way of creating a blog through ASP.NET MVC framework. All the articles I'm going to submit have the same layout only the main content differs. So I created a common view that dynamically loads the content from a physical file(contains only the particular article markup) in a section. So all the url requests send by users points to a same physical file that dynamically loads a particular section based upon the article. Is this the right approach? Is this create some problems in SEO? I'm eager to hear from you. [UPDATED] The urls of the articles look like http://myblog.com/blog/archives/2011/1/using_asp_mvc. All these kind of requests are received by a single page that loads the content of the article from another physical file in it.
I think you need to rewrite the url like in
http://www.cricandcric.com/Cricket-News/4182/Cricket-West-Indies-:-I-am-sure-that-West-Indies-will-bounce-back-,-said-Hooper.html
If you obesrve this I have done the URL Rewriting, which will be user friendly for the search engines.
You can find good references at following url
http://www.webconfs.com/url-rewriting-tool.php
http://corz.org/serv/tricks/htaccess2.php
To make it more understandable about the power of url rewriting
if you search for "I am sure that West Indies will bounce back , said Hooper"
in google, you can see our cricandcric.com in the first page

Rails Sanitize: Safety + Allowing Embeds

We're building a user generated content site where we want to allow users to be able to embed things like videos, slideshares, etc... Can anyone recommend a generally accepted list of tags / attributes to allow in rails sanitize that will give us pretty good security, while still allowing a good amount of the embedable content / html formatting?
As long as is turned off, you should be able to allow objects. You might even be able to to define the actual acceptable parameters of object tags, so that you only allow a whitelist, and abitrary objects cannot be included.
However, it may be better to provide some UI support for embedding. For example, I prompt the user for a YouTube URL and then derive the embed code for the video from that.
Several benefits:
- the default YouTube code is not standards compliant so I can construct my own Object code
- I have complete control over the way embedded elements are included in the output page
Honestly saying allowing users to use WYSIWYG Html editors might sound good, but in practice it just doesn't work well for both users and developers. The reasons are:
Still too different behavior in different browsers.
Whitelist allows to secure site, but users will end up calling and asking to allow another parameter for OBJECT tag or similar. Blacklists are just not secure.
Not many users know what HTML tag is.
For users it is hard to format text (how can you tell them to use HEADER instead of BOLD+FONT-SIZE).
Generally it is pretty painful and you cannot really change the site design if needed because of users did not properly use HTML.
If I would be doing CMS-like system now, I would probably go with semantic markup.
Most users, get used to it quickly and it is just plain text (as here at SO).
Also YOU can generate proper HTML and support needed tags.
For example if you need to embed picture you might write something like:
My Face:image-http://here.there/bla.gif
Which would generate HTML for you like this:
<a class='image-link' title='My Face' href='http://here.there/bla.gif'>
<img alt='My Face' src='http://here.there/bla.gif' />
</a>
There are plenty of markup languages around, so just pick the one is more appropriate to you and add your own modifications.
For example, GitHub uses modified markdown and the code to parse it is just a couple of lines.
One disadvantage is that users need to learn the language and it is NOT WYSIWYG.
Regards,
Dmitriy.
There's a great project for this. It even has embed-analysis to only allow youtube embeds, for example
https://github.com/rgrove/sanitize

the best way to "link" two web pages

I was planing to make two web pages (different domains) which deal with similar subject. On the first page there would be published articles and I would like to show those articles on the other page also (here would be displayed for example only last 10 articles). What is the best way to realize this?
EDIT: I use php/mysql
You should store your articles in a database which is available from both pages (are they on the same webserver?)
Then on one page you could do this:
SELECT title, summary FROM articles ORDER BY date DESC
and on the other:
SELECT title, fulltext FROM articles ORDER BY date DESC LIMIT 10
You can serve both web pages from the same webserver even if the domain names are different.
Sounds like you're not "linking" the two pages together, you're presenting two different views of the same data - the first page shows the full articles, the second page shows perhaps titles only of the last 10 articles.
If both sites don't have access to the same database, you have to provide some kind of API for your first site that exports the last 10 articles in XML, JSON, whatever and include this into your second site.
If you don't have the possibility to use the same database from the 2 different sites, you could also create a rss feed (or similar) of the 10 last articles, and use that to display the articles on the other site!

Resources