We have a requirement to have comments on an application auto-moderated for obscenities etc. There will still be a real world moderating looking at comments, but they want to filter out the really bad stuff automatically.
What would be the best way to do this?
You'll want to use something like Akismet. Last I checked, it's the thing WordPress uses to prevent comment spam. It's pretty straightforward, too.
Here are a couple links:
Ryan Bates over at Railscasts
(always excellent, but a little old
so may not current with Rails 3
though)
Another tutorial on
using Akismet in Rails
Here's a gem/plugin: Rakismet
As for profanity instead of general spam, you might want something like WebPurify (first one I found). You can hack your own together by blacklisting profane words and using Ruby's elegant string handling to replace/filter them, but that would probably be a never ending battle against 1337 speak and stuff like that. Plus it's probably better practice to outsource something like that, if possible.
Auto-moderation for obcenities is always going to give problems, as shown by Jeff Atwoods Post about it: Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?.
Now, on the other hand, actively filtering Spam using an automatic (third party) text analysis tools is certainly an viable option ( Mollom for example, which also have a ruby plugin )
You can try fu-fu, not perfect but will filter out a lot.
http://github.com/adambair/fu-fu
Related
I'm building an app to write wine tasting notes, and I have to translate this tasting framework (only the first page) into a model.
It's a lot of a data and I'm not sure about how to proceed. I tried to sketch a possible solution in this spreadsheet.
What would you suggest to do? Should I create only one model (Wine) with a column for each wine characteristic?
Thanks!
P.S. I'm learning web development, sorry if my question sounds trivial.
Perhaps not a direct answer to your question but here's a few notes nevertheless:
I'd say for something like wine tasting, I would go for a combination of selects, tags and simple free-text.
This RailsCast should give you a good introduction to the acts-as-taggable-on gem for tagging.
One thing that may be handy when modelling your DB is to look at several wine tasting 'forms' that are already filled out and see if you can see a pattern. Say for Palate/Acidity I would expect the value to be one of light/medium/high, whereas for the Conclusion/Identity it could be pretty much anything.
You'll also need to find the right balance between restricting and allowing the user input. I'd expect your users to be happier with free-text and feeling more restricted with select/radio boxes. On the other hand, it is always easier (for you) to change from a select box to free text, rather than the other way around. Not to mention searching is much easier by selects or tags.
I don't think your question is that trivial and I think you should to simply try it and see. And design with the ability to change in mind.
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.
I have a decent understanding of RESTful urls and all the theory behind not nesting urls, but I'm still not quite sure how this looks in an enterprise application, like something like Amazon, StackOverflow, or Google...
Google has urls like this:
http://code.google.com/apis/ajax/
http://code.google.com/apis/maps/documentation/staticmaps/
https://www.google.com/calendar/render?tab=mc
Amazon like this:
http://www.amazon.com/books-used-books-textbooks/b/ref=sa_menu_bo0?ie=UTF8&node=283155&pf_rd_p=328655101&pf_rd_s=left-nav-1&pf_rd_t=101&pf_rd_i=507846&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=1PK4ZKN4YWJJ9B86ANC9
http://www.amazon.com/Ruby-Programming-Language-David-Flanagan/dp/0596516177/ref=sr_1_1?ie=UTF8&s=books&qid=1258755625&sr=1-1
And StackOverflow like this:
https://stackoverflow.com/users/169992/viatropos
https://stackoverflow.com/questions/tagged/html
https://stackoverflow.com/questions/tagged?tagnames=html&sort=newest&pagesize=15
So my question is, what is best practice in terms of creating urls for systems like these? When do you start storing parameters in the url, when don't you? These big companies don't seem to be following the rules so hotly debated in the ruby community (that you should almost never nest URLs for example), so I'm wondering how you go about implementing your own urls in larger scale projects because it seems like the idea of not nesting urls breaks down at anything larger than a blog.
Any tips?
Don't get too bogged down by the "rules" in the Ruby community. The idea is that you shouldn't go overboard when nesting URLs, but when they're appropriate they're built into the Rails framework for a reason: use them.
If a resource always falls within another resource, nest it. Nothing wrong with that. Going deeper than one can sometimes be a bit of a pain because your route paths will be very long and can get a bit confusing.
Also, don't confuse nesting with namespacing. Just because you see example.com/admin/products/1234/edit does not mean that there's any nesting happening. Routing can make things look nested when they're actually not at the code level.
I'm personally a big fan of nesting and use it often (just one level -- occasionally two) in my applications. Also, adding permalink style URLs that use words rather than just IDs are more visually appealing and they can help with SEO, whether or not they're nested.
I believe the argument for or against REST and/or nesting in your routes has much to do with how you want to expose your API. If you do not care to ever expose an API for your app publicly, there is an argument to be made that strict adherence to RESTful design is a waste of time, particularly if you disagree with the reasoning behind REST. Your choice.
I find that thinking about how a client (other than a browser) might access information from you app helps in the design process. One of the greatest advantages of thinking about your app's design from an API perspective is that you tend to eliminate unnecessary complexity. To me this is the root of the cautions you hear in the Rails community surrounding nested routes. I would take it as an indication that things are getting a bit complicated and it might be time to step back and rethink the approach. Systems "larger than a blog" do not have to be inherently complex. Some parts might be but you also may be surprised when you approach the design from a different perspective.
In short, consider some of the dogma you might hear from certain parts of the community as guides and signals that you may want to think more deeply about your design. Strict REST is simply another way to think about how you are structuring your application.
I'm attempting to learn Ruby on Rails. I'm pretty confident with the basics and writing my own models, controllers and views, although I only know the basics.
Lately I've found that, when I start a new application, most of my models nicely fit into the REST philosophy, and I end up just writing most of the same scaffold-generated code by hand anyways. In a case like this, do you think it would be acceptable to start from using script/generate scaffold for each of my required models, and then modifying code as necessary? The prevailing opinion that I've seen seems to be that the scaffolding is a "newbie trick" and real developers don't use it, but for most applications it seems to create a fair chunk of usable code (as opposed to bad code).
What are your thoughts?
I'm not sure if the culture is really against scaffolding or not, but I, for one, love it.
Now, what I do know is that there was sort of a small backlash against scaffolding for a while. This was because basically every Rails tutorial was basically 'whoa, just type
ruby script/generate scaffolding Post title:string body:text
and you have a blog! Done!'
This not actually being the case, the community started to pull back on using scaffolding as that kind of example, because when you do this, you're not done.
The real power of scaffolding, and the reason that I love it, is for its rapid prototyping abilities. You can generate half of your web site, start coding the back end, and still have a usable interface to actually play around with how everything works without worrying about writing interface code.
I think the idea is that you use it to generate 'generic' code, and then rewrite/refactor it to your specific requirements.
I think there is no problem using the code out of the box generator if it does what you want - as long as you remember what it all does from a security point of view (e.g. don't leave in edit/update/destroy actions if you don't want your interface to allow this)
One thing I like, to avoid lots of duplicate generated code, is the make_resourceful plug in - it abstracts the duplicate default code and lets you override it as appropriate.
Rails' fanatical devotion to choosing smart defaults is exactly why you observe that when you hand write code it ends up looking like the code generated by scaffolding. Personally I really enjoy using scaffolds because there's only a couple of tweaks needed at the end (layouts, CSS, validations, etc etc) for those really basic CRUD type operations.
Even when you get to things that are more complex, it's easy to start from a scaffold and move towards what you want. Definitely not just a newbie's tool.
It is a myth that scaffold is meant only for newbies. It is a great tool to kick start you application really quickly. Of course you will need to modify the generated code to suit your requirements. Having said that, it never hurts to have a ready -made code, most of which will be used as is.
When you start working on an existing Rails project what are the steps that you take to understand the code? Where do you start? What do you use to get a high level view before drilling down into the controllers, models, helpers and views? Do you have any specific techniques, tricks or tools that help speed up the process?
Please don't respond with "Learn Rails & Ruby" (like one of the responses the last guy who asked this got - he also didn't get much response to his question so I thought I would ask again and prompt a bit more). I'm pretty comfortable with my own code. It's sorting other people's that does my head in and takes me a long time to grok.
Look at the models. If the app was written well, this should give you a picture of its domain model, which is where the interesting logic should live. I also look at the tests for the models.
The way that the controllers/views were implemented should be apparent just by using the Rails app and observing the URLs.
Unfortunately, there are many occasions where too much logic lives in controllers and even views. That means you'll have to take a look into those directories too. Doubley-unfortunate, tests for these layers tend to be much less clear.
First I use the app, noting the interesting controller and action names.
Then I start reading the code for these controllers, and for the relevant models when necessary. Views are usually less important.
Unlike a lot of the people so far, I actually don't think tests are the place to start. I think they're too narrow, too focused. It'd be like trying to understand basic physics/mechanics by first zooming into intra-molecular forces and quantum mechanics. I also think you're relying too much on well-written tests, and in my experience, a lot of people don't write sufficient tests or write poor tests (which don't give an accurate sense of what the code should actually do).
1) I think the first thing to do is to understand what the hell the app actually does. Use it, at least long enough to build an idea of what its main purpose is and what the different types of data might be and which actions you can perform, and most importantly, why.
2) You need to step back and see the big picture. I think the best way to do that is by starting with schema.rb. This tells you a few really important things:
What is the vocabulary/concepts of this project. What does "User" actually mean in this app? Why does the app have both "User" and "Account" models and how are they different/related?
You could learn what models there are by looking in app/models but this will actually tell you what data each model holds.
Thanks to *_id fields, you'll learn the associations between the models, which helps you understand how it all fits together.
I'd follow this up by looking at each model's *.rb file for (hopefully) comments, validations, associations, and any additional logic relevant to each. Keep an eye out for regular ol' Ruby classes that might live in lib/.
3) I, personally, would then take a brief glance at routes.rb as it will tell you two key things: a brief survey of all of the actions in the app, and, if the routes and controllers/actions are well named and thought out, a quick sense of where different functionality might live.
At this point you're probably ready to dig into a specific thing you need to learn. Find the controller for the feature you're most interested in and crack it open. Start reading through the relevant actions, see which models are involved, and now maybe start cracking open tests if you want to.
Don't forget to use the rest of your tools: Ruby/Rails debuggers, browser dev tools, logs, etc.
I would say take a look at the tests (or specs if the project uses RSpec) to get an idea at the high-level of what the application is supposed to do. Once you understand from the top level how the models/views/controllers are expected to behave, you can drill into the implementations.
If the Rails project is in a somewhat stable state than I have always been a big fan of using the debugger to help navigate the code base. I'll fire up the browser and begin interacting with the app then target some piece of functionality and set a breakpoint at the beginning of the associated function. With that in place I just study the parameters going into the function and the value being returned to get a better understanding of what's going on. Once you get comfortable you can modify the functionality a little bit to ensure you understand what's going on. Just performing some static analysis on the code can be cumbersome! Good luck!
I can think of two reasons to be looking at an existing app with which I have no previous involvement: I need to make a change or I want to understand one or more aspects because I'm considering using them as input to changes I'm considering making to another app. I include reading-for-education/enlightenment in that second case.
A real benefit of the MVC pattern in particular, and many web apps in general is that they are fairly easily split into request/response pairs, which can to some extent be comprehended in isolation. So you can start with a single interaction and grow your understanding out from that.
When needing to modify or extend existing code, I should have a good idea of what the first change will be - if not then I probably shouldn't be fooling with the code yet! In a Rails app, the change is most likely to involve view, model or a combination of both and I should be able to identify the relevant items fairly quickly. If there are tests, I check that they run, then attempt to write a test that exposes the missing functionality and away we go. If there are no tests then it's a bit trickier - I'm going to worry that I might inadvertently break something: I'd consider adding tests to give myself a more confidence, which will in turn start to build some understanding of the area under study. I should fairly quickly be able to get into a red-green-refactor loop, picking up speed as I learn my way around.
Run the tests. :-)
If you're lucky it'll have been built on RSpec, and that'll describe the behavior regardless of the implementation.
I run rake test in a terminal
If the environment does not load, I take a look at the stack trace to figure out what's going on, and then I fix it so that the environment loads and run the tests again
I boot the server and open the app in a browser. Clicking around.
Start working with the tasks at hand.
If the code rocks, I'm happy. If the code sucks, I hurt it for fun and profit.
Aside from the already posted tips of running specs, and decomposing the MVC, I also like:
rake routes
as another way to get a high-level view of all the routes into the app
./script/console
The rails irb console is still my favorite way to inspect models and model methods. Grab a few records and work with them in irb. I know it helps me during development and test.
Look at the documentation, there is pretty good documentation on some projects.
It's a little bit hard to understand other's code, but try it...Read the code ;-)