Do I have to care about indent when HTML is generated?

Do I have to care about indent when HTML is generated? - ruby-on-rails

As you know, if we use the form helper something like select_tag all the generated HTML's indent comes back to the very left.
Then generated HTML source looks ugly.
How do you handle this problem?
and does it matter to SEO?

As long as its valid HTML then search engines won't care about your whitespace/layout. Infact they can handle even invalid mark up to be honest. You don't get bonus points for having a clear HTML layout though I'm afraid.
Nice layouts make development/debugging easier and you should at least try to make it look pretty. But then using code generators like this it not always possible.
Focus on getting it working well and worry about prettying it up later if you have time ;)
I'm sure there is a way to modify the output if Rails is still as awesome as it used to be last time I had a play with it.

I use the erector gem to create my views, erector is basically an HTML-DSL in plain ruby which, among other things, allows to pretty_print your HTML. This is usually just done during development, as the additional whitespace would just increase the page size, burn bandwidth and slightly slow down the browser.
For this reason I cannot imagine that search-engines care at all. They just might have to do a little more work, so they might rank it a little lower, but I even doubt that much. They might care for valid HTML, that is another advantage of erector (and other templating languages as HAML and the like).

Related

Do semantics matter in Latex? If not, why not?

When I ask questions about achieving some particular layout LaTeX, I get answers that suggest I should use constructs that don't make sense for their semantics. For example, I wanted to intent a single paragraph, and I was told to make it a list with no bullets. It works, but that isn't the semantic meaning of a list, so why is it acceptable to abuse it like that.
We stopped doing it in HTML over a decade ago. Why are we still doing the equivalent of table layout in supposedly the best typesetting system there is?
Am I not getting it, but isn't this a little inelegant? Everyone says LaTeX is elegant and that you don't need to worry about layout, but then I find myself contorting tables, lists and other semantic markup to put stuff where I want it. Does the emperor have no clothes, or am I not getting it?

When a problem like this comes along, and the answer is to use something that doesn't really make semantic sense, what you should do is create a new environment or command that wraps the functionality in a way that makes semantic sense.
Every layout language has this problem -- somewhere along the line, you need to get down to a physical, non-semantic solution. In HTML, the non-semantic parts of the solution are now pretty-well covered by by CSS and JavaScript (which are different languages from HTML). You create <div>s and <span>s that capture the semantics, and then you use CSS and JavaScript to define the physical layout for those semantics.
In LaTeX, you simply wind up using the exact same language for this purpose: LaTeX (or plain TeX, which is often hard to differentiate from LaTeX).

I'd say it's all a matter of knowing or finding the right semantics. You talk about a single example, and you don't provide your semantics, you talk about the way to layout it. So depending on what it is that you want indented, there might be better fits, e.g. a quote, a formula, etc.

Why should I test my HTMLHelpers?

Is there any tangible value in unit testing your own htmlhelpers? Many of these things just spit out a bunch of html markup - there's little if no logic. So, do you just compare one big html string to another? I mean, some of these thing require you to look at the generated markup in a browser to verify it's the output you want.
Seems a little pointless.

Yes.
While there may be little to no logic now, that doesn't mean that there isn't going to be more logic added down the road. When that logic is added, you want to be sure that it doesn't break the existing functionality.
That's one of the reasons that Unit Tests are written.
If you're following Test Driven Development, you write the test first and then write the code to satisfy test.
That's another reason.
You also want to make sure you identify and test any possible edge cases with your Helper (like un-escaped HTML literals, un-encoded special characters, etc).

I guess it depends on how many people will be using/modifying it. I typically create a unit test for an html helper if I know a lot of people could get their hands on it, or if the logic is complex. If I'm going to be the only one using it though, I'm not going to waste my time (or my employer's money).
I can understand you not wanting to write the tests though ... it can be rather annoying to write a few lines of html generation code that requires 5X that amount to test.

it takes a simple input and exposes a simple output. This is a good one for TDD, since the time you were going to spend on build->start site->fix that silly issue->start again->oops, missed this other tiny thing->start ... we are done, happy :). Dev 2 comes along and makes small change to "fix" it for something that wasn't working for then, same cycle goes on and dev 2 didn't notice at the time it broke your other scenarios.
Instead, you v. quickly do the v. simple simple text, y that simple output gave you that simple output you were expecting with all the closing tags and quotes you were expecting.

Having written HTML Helpers for sitemap menus, for example, or buttons for a wizard framework, I can assure you that some Helpers have plenty of logic that needs testing to be reliable, especially if intended to be used by others.
So it depends what you do with them really. And only you know the answer to that.
The general answer is that Html Helpers can be arbitrarily complex (or simple), depending on what you are doing. So the no brainer, as with anything else, is to test when you need to.

Yes, there's value. How much value is to be determined. ;-)
You might start with basic "returns SOMEthing" tests, and not really care WHAT. Basically just quick sanity tests, in case something fundamental breaks. Then as problems crop up, add more details.
Also consider having your tests parse the HTML into DOMs, which are much easier to test against than strings, particularly if you are looking for just some specific bit.
Or... if you have automated tests against the webapp itself, ensure there are tests that look specifically for the output of your helpers.

Yes it should be tested. Basic rule of thumb, if it is not worth testing it is not worth writing.
However, you need to be a bit carefull here when you write your tests. There is a danger that they can be very "brittle".
If you write your tests such that you get back a specific string, and you have some helpers that call other helpers. A change in one of the core helpers could cause very many tests to fail.
So it maybe better to test that you get back a non null value, or that a specific text is contained somewhere in the return value. Rather than testing for an exact string.

Intelligently extracting tags from blogs and other web pages

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.

Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.

If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.

Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...

I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.

This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.

Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.

Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!

there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.

Any clever workaround to avoid having to type the h method everywhere?

It seems ridiculous (and a violation of DRY) to have to type the h method all over the place in your view code to make it safe.
Has anyone come up with a clever workaround for this?

DHH (creator of Rails) agrees with you. Rails 3 will escape output by default.

You could use Erubis as your ERB engine - which does offer auto-escaping. Their benchmarks mark it as 3x fast as ERB.
http://www.kuwata-lab.com/erubis/
The only problem is that its only for ERB, so if your Haml or some other templating language (like us) then you're SOL. I have used Erubis in the past and had no problems with it - before we switched to (the slower) Haml.

You could use XSS_terminate which filters data going into your app (on save) instead of trying to catch it at the last possible second with h().
Theoretically, this should be sufficent and you shouldn't need to do anything else.
If you want to be paranoid (which in the context of security is not a bad thing), you should do both.

The Rails 3 approach is definitely the best on the view side because it explicitly keeps track of the safety of each string, which is ultimately what you need (taint mode) for a robust solution.
However there's another approach which is what ActsAsTextiled does. That is to redefine the attribute accessor to sanitize and cache the result so that you always get sanitized output by default. What I like about this as opposed to the xss_terminate approach is that it doesn't touch the user input at all so you get fewer complaints from users, and data won't be accidentally clobbered, and you can go and change the rules later if you overlooked something.
I liked the approach so much I wrote a plugin using the Sanitize gem ActsAsSanitiled. It doesn't give you blanket protection out of the box the way xss_terminate can, but it also avoids unwanted side effects. In my case comparatively few of the text fields are actually edited by users directly, so I prefer to audit them and explicitly declare them.

Best Practices for using partials in Rails

In keeping with the DRY-principle I try to use partials as soon as I am repeating a particular pattern more than once or twice. As a result, some of my views consist of ten or more different partials. I am worried that this might have a negative effect on the overall performance. Some programming books compare the use of partials with the use of methods. So should I use the same rationale to determine when to use them?
What is the best practice regarding size and quantity of partials in a Rails project?

I like your practice already: Once you've repeated view code twice, refactor it out to a partial. Tim's right that you can speed it up as necessary after it's been profiled and after it's been proven necessary.
Here is my one caveat: If you work with professional designers who handle the views, it may be easier over the long term to have rather repetitive view code. Some people have a hard time searching through partials and "seeing" how they all fit together. I've found it easier with those people to let them manage the whole shebang and update more than one file if they need to. Optimal? Not to us as programmers, but designers are more often used to seeing most of the HTML in one or three files rather than 20. :)

Remember the rules of optimization!
If, once your application is complete, your views are too slow, use something like New Relic to find out where the slowdown is occurring. There are a lot of places that might be, but it's unlikely to be in your partials.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart