Good Language for Spider and Indexer - parsing

I love Ruby and its framework, but I don't think that Ruby On Rails is the best choise to develop a Feed-parser and Indexer.
Maybe Python or Java are better choises. What language do you suggest?

I think Ruby is just fine for any of these kind of tasks:
http://rubyrss.com/
http://www.ruby-doc.org/stdlib/libdoc/rss/rdoc/index.html
http://railscasts.com/episodes/173-screen-scraping-with-scrapi
If you are comfortable with Ruby I see no reason to shell out to Java, Python et el. for most tasks. Keep in mind lots of the Ruby libraries sit on native implementations.

A Feed (RSS?) is usually pretty well structured (compared to a regular web page, at least). Check out Web Harvest, a Java / bean shell-based DOM parser (among other things). You can use this to automate grabbing data off the internet. There is a domain-specific language (defined in XML) that you'll have to learn. It's learning curve might be a bit steep, but I felt that it's well worth the effort.

I am not very familiar with Java, but I can say Python is very well suited for the job.
There is this very fast XML parser module called BeautifulStoneSoup, which you can use. It is part of the BeautifulSoup library. And if you're only looking for a simple indexer, Python has an sqlite engine builtin which is also lightweight and very fast.

Related

Is Erlang the right choice for a webcrawler?

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
easily scalable on multiple cores and cpus
suited for high I/O loads
fast regular expression matching
easily to maintain/few operational overhead
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?
I am also evaluating erlang for use as a web crawler and it looks good so far.
There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.
And other people are interested in the same use case, so you can learn from them.
However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.
If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:
Don't use regular expressions to parse HTML, use XPATH instead.
HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
Determine what your crawler architecture is going to be and what is your re-visit policy.
Find the best selection policy for you and implement it.
A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.
Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

case class Rails extends Scala with Drupal (exists: Boolean)

I like aspects of Scala, Rails, and Drupal, and would be interested in Scrupal
Scala because it's strongly typed, concise, fast, both functional & OO, and massively scalable (not to mention being the [TBD] successor to Java)
Rails because it brings together a number of sensible conventions under one roof that make web development creative and enjoyable. In a bare nutshell, MVC + Routing + ORM + Validation + REST (much more obviously, including Ruby, but in terms of the essential components)
Drupal for providing a flexible drag & drop interface that unifies content management and content display.
Now, having just read Odersky et al Programming in Scala, and discovering the ScalaQuery ORM (a JVM LINQ) and its totally SIQ future with TypeSafe, the language and ORM components of Scrupal are taken care of (why not .NET/C#/LINQ? Valid point. I prefer Linux and Mono is always a step behind and/or restricted by what they can replicate in terms of M$ latest & greatest)
As for Scala Rails (Scails), Scalatra provides routing, REST and the V(iew) with Scalate, but is missing the MC as well as the ORM. Love the lightweight simplicity but Scalatra is Sinatra, not Rails. Play 2.0 Scala version is apparently Rails inspired, but running through the documentation does not bring a sigh of relief by any means, particularly the non-ORM, ANORM. For the time being maybe Play 2.0 is the only viable Scala Rails option, will have to dig a little deeper, wish it was more DRY and concise like Scala itself.
So, assuming Scala Rails already exists, the missing link is Scala Drupal, arguably the hardest part given that it does not exist ;-) Perhaps it's a pipe dream combining Drupal's data driven Content Management Framework (CMF) with a compiled language like Scala. Are the two mutually exclusive or is it possible to create a performant data driven Scrupal? If so, how do you envision the full stack?
play is actually very neat and by the way it does not tie you to Anorm, I have used hibernate/JPA and Sienna both of which work pretty fine. it would be interesting to note what you found missing in the framework :-)
implementing something like Drupal is very possible in whatever stack you would choose! I would say go with the TypeSafe stack.
Have you looked at http://liftweb.net/ ? It covers REST, ORM, Validation, you can use MVC, but the more common method is view first. It also bring great comet/ajax support, security you don't have to worry about and a few other things

erlang template engine. sgte, google-cTemplate, or erlydtl

I am planning to implement a template engine to my erlang project, and the most important thing is the performance. Currently I have a lot of Velocity Java template, and I want to migrate templates available to erlang.
I googled it, and found things like;
http://www.ivan.fomichev.name/2008/05/erlang-template-engine-prototype.html
erlydtl
google-ctemplate
sgte
Pure erlang implementation would be the best, but c(c++) based template engine, i.e. google-ctemplate, performs better, I would use it with erlang linked in driver.
Have no experience on this matter yet, so anyone's suggestions would be super great.
thanks
My personal favourite is erlydtl. It compiles the template to an erlang module, so there's no filesystem access or parsing time consumed when you call 'render'.
I think rebar has erlydtl support these days, so getting your templates compiled is a lot less hassle than it used to be. Just name them *.dtl and they'll get built when you run rebar compile.
It should also be fairly competitive speed wise as it's in-process (skip the IPC cost of a port program), compiled (and could be compiled to native code if you wanted to), and generates iolists which are pretty efficient.

Do ruby on rails programmers refactor?

I'm a Java programmer who started programming Ruby on Rails one year ago. I like the language, rails itself and the principles behind them. But something that bothers me is that Ruby programmers don't seem to refactor.
I noticed that there is a big lack of tools for refactoring in Ruby / Rails. Some IDE's, like Aptana and RubyMine seem to offer some very basic refactoring, but nothing really big compared to Eclipse's Java refactorings.
Then there is another fact: most railers (even the pros) prefer some lightweight editors, like VIM or TextMate, instead of IDEs. Well, with these tools you just get zero refactoring (only regex with find/replace).
This leaves me this impression that rails programmers don't refactor. It might be just a false impression, of course, but I would like to hear the opinion of people who work professionally with ruby on rails.
Do you refactor? If you do, how do you do it,with which tools? If not, why not?
Definitely yes, there is a different reason for the tool disparity
An IDE is more practical to construct for Java
Java's strict typing and documented grammar make it possible to write language-parsing IDE tools
Ruby's duck typing and documented-by-the-Yacc-source grammar make it quite difficult to do so.
An IDE is more needed for Java
Java's verbosity makes code-writing and code-rewriting tools desireable.
Ruby's extremely terse nature combined with the typically-no-type-declarations (of course they do appear inline with Type.new) make such things optional.
Combining the two...
So the combination of really hard to write coupled with not actually needed results in the balance tipping in favor of people's favorite editors.
Giving up vi(1) for an IDE is something I would rather not do, but I do with Java because I need the IDE to write my interface implementations and such, and the fact that it parses Java makes it useful in code completion. Since with Ruby it can't and I don't need it anyway, I stick with vi(1) and TextMate.
Summary
Since you aren't buried in code, it's possible to refactor with a few reasonable edits. But while on the subject of "other Ruby developers", my Ruby question is: why does everyone (except me it seems) use function parens? Because in a few % of situations they are needed, and so the "inconsistency" is disturbing?
Yes.
Most Rails programmers try and follow a Test first, write code to pass the test, then refactor the code BEFORE they go onto the next test.
Do ALL rails/ruby programmers... probably not, but as far as a 'vibe' or 'feel' in this community, I'd say it's something that is preached and practiced enough that it happens more times than not.
There is no need for IDEs imo. VIM, emacs and/or textmate is enough for Ruby and most rails programmers. I guess Java needed more compiling or something, what do I know about that though, as I've only programmed in Ruby. Why do all Java programmers use IDEs (since I'm generalizing).
RoR developers do refactor a lot. But most importantly they do it because they can do it easily.
If you keep to the main principle of RoR - Do not Repeat Yourself - and spent some time on code design (which means you didn't happen to create a huge chunk of monolithic code), nothing can stop you to rewrite a piece of code, whatever is in your mind (generalization, speedup, improving readability, etc.). The built-in testing/benchmarking/profiling functionality of Rails is at your service to check if you achieved your goal without sacrificing already existing and working functionality.
The editor is totally independent of the code, therefore you could even use Notepad for coding (I'm not a command-line fanatic, I prefer a bit more 'graphical' editors like Gedit).
Having spent a lot of time in both Java and Ruby (with a good bit of back-and-forth of late, from Eclipse to/from Textmate) I agree that certain kinds of refactorings are harder in Ruby. This is less a consequence of poorer IDEs for Ruby than it is the fact of static typing vs. dynamic typing and the difficulty of writing refactoring tools for a dynamic language. To a large degree manual/regex driven refactoring is easier in Ruby than it would be in Java because of the terseness of Ruby code -- there's just less of it --, but nevertheless something as simple as renaming a method is not as straightforward in Ruby as it is in Java. The benefits of Ruby vs. Java are (imo) greater by far (and you'll just have to use Ruby in production for a few months to get a real feel for just how much you'll love it), but one drawback is the lack of the same robust refactoring that you're used to in tools like IDEA and Eclipse.
EDIT: And just to be clear -- I don't do any less refactoring in Ruby per se than Java, but it seems I need it less for Ruby. But when I do I rely on unit tests rather than the compiler as I would in Java.

Why do so many insist on dragging the JVM into new applications?

For example, I'm running into developers and architects who are scared to death of Rails apps, but love the idea of writing new Grails apps.
From what I've seen, there is a LOT of resource overhead that goes into using the JVM to support languages such as Groovy, JRuby and Jython instead of straight Ruby or Python.
Ruby and Python can both be interpreted on just about any OS, so I don't see any "write once run anywhere" advantage... why bring the hulking JVM along with you?
Java is a much, much more mature platform, with a lot of existing class libraries that could be "dropped in" and used, than, say, Ruby or Python (or even Perl, for that matter). So for people who like using existing code, rather than writing everything themselves, Java is a huge win.
For example, recently I've been looking for something like JAXB for Python or Ruby. In the end, I ended up using JRuby just because I haven't found any mature, widely-used XML-binding libraries.
The huge advantage of writing code (in any language) for the JVM is that it's usually very easy to tap into the enormous wealth of mature Java libraries out there, if necessary.
And I don't know where you got this idea of a "hulking" JVM with a huge resource overhead. The JIT tends to produce code that is quite fast, and the core JVM is anything but huge by today's standards. It does tend to have a huge memory footprint when running, but that's because modern machines have a lot of RAM and the GC works best when it has a lot of RAM to play with. If desired, the GC can be fine-tuned to hell and back to be more conservative.
As someone else put it: "The best thing about Groovy is that I don't have to use Java. The second best thing about Groovy is that I can use Java".
An assumption that seems to be built into the question is that new projects are greenfield projects. Many organizations have made a huge investment in Java over the last decade+ and require any new project to work within the existing (internal) code ecosystem. As pointed out, there's a huge bonus in all the publicly available Java libraries (whether free/OSS or commercial), but the need to work with existing code and even as a component within an existing system is at least as important (if not more so) to large organizations.
A lot also comes down to the maturity and capability of the platform, which is to say the JVM and everything that comes with it (the entire Java ecosystem). A few examples off the top of my head:
You can plug a remote debugger into a running JVM and get all kinds of information about a running application that is simply impossible with Python, Ruby, etc. Going a step further, there's JMX, a standard way to write code so that objects can be monitored and even tweaked in a live application. Take a look at JConsole and see if you don't drool just a little (despite the ugliness of the interface).
Going even further in this direction, there's OSGi, a standard for writing highly modular code that can be deployed, started, stopped, and even upgraded in a live application. With OSGi you break a large application into many smaller "bundles" which can then be maintained (deployed, started/stopped, upgraded) separately. This is a really big deal in large applications, or any applications that need to remain running at all times.
The platform has very good support for asynchronous, reliable messaging. You get JMS as a baseline, and many excellent and powerful libraries built on it for doing complicated things with very little code (cf. Apache Camel, ServiceMix, Mule, and many others). This is another feature that's extremely valuable in larger applications or those which must run within a larger code universe.
The JVM has real (OS-level) threading, while Python et al. are very limited in this regard (notoriously so). (That being said, shared state concurrency -- threading -- is the wrong approach; cf. Erlang, Alice, Mozart/Oz, etc.)
There are numerous JVM choices beyond the standard Sun implementations, like JRockit, IBM's JVM, etc. This is a developing area with other languages -- Python has Jython, Iron Python, even PyPy and Stackless; Ruby has JRuby, Rubinius, and others -- but as good as these are they can't match the maturity found in the various JVM offerings.
All that being said, I really don't like Java the language and avoid it as much as possible. These days with all the excellent alternative languages for the JVM I don't have to. Groovy gets my vote for its accessibility and tight integration with the platform (and even the language), and because of Grails, which I sometimes like to call "Rails for grownups". I like other JVM languages better, particularly Clojure and Scala, but these aren't as accessible to the average programmer. Scala is popping up a lot lately, though, especially thanks to its high profile use at Twitter, so there's hope for interesting and truly excellent languages making it in larger environments. But that's another topic.
why bring to hulking JVM along with you?
JVM isn't bloated, nor is it slow. on the contrary, it's a lean, fast, deeply optimized VM. Unfortunately, it's optimized for static OOP languages.
Still, good compilers targeting JVM do create good performing programs. I don't know about JRuby; but Jython's goal is to be all-around faster than regular C Python, and they're getting close (it's already faster at several important use cases).
Remember that a good JIT (like those for JVM) can apply some optimizations unavailable on static C compilers, getting faster code from them isn't a pipe dream. Of course, a VM optimized for your language should be faster than a 'not-really-generic' VM like JVM; but there's the maturity issue: JVM has a lot of work done there, while JITs for Ruby and Python aren't anywhere near.
Unfortunately, there doesn't seem to be any better generic bytecode VM. Microsoft's CLI suffers from similar limitations as JVM (ironPython is much slower and heavier than JPython). The best candidate seems to be LLVM. Does anybody know why isn't there more dynamic languages over LLVM? I've seen a couple of Scheme compilers, but seem to have several problems.
Groovy is NOT an interpreted language, it is a dynamic language. The groovy compiler produces JVM bytecode that runs inside the JVM just like any other java class. In this sense, groovy is just like java, simply adding syntax to the java language that is meaningful only to developers and not to the JVM.
Developer productivity, ease and flexibility of syntax make groovy attractive to the java ecosystem - ruby or python would be as attractive if they resulted in java bytecode (see jython).
Java developers are not really scared of ruby; as a matter of fact many quickly embrace groovy or jython both close to ruby and python. What they don't care about is leaving such an amazing platform (java) for a less performant, less scalable even less used language such as ruby (for all its merits).
The big knock on RoR is that it isn't scalable and hard to deploy. By using the Java platform, you can leverage your existing infrastructure.
grails war
Produces a war file that is easily deployed on Glassfish, Jboss, etc.
Ruby and Python can both be
interpreted on just about any OS, so I
don't see any "write once run
anywhere" advantage... why bring the
hulking JVM along with you?
Mostly because you want to take advantage of the HUGE existing ecosystem of Java libraries, APIs and products, which dwarfs anything available for Ruby or Python, especially in the enterprise domain.
Also, keep in mind that JRuby and Jython are faster in a lot of benchmarks than the regular (C implementations) of the languages, especially Ruby (even Ruby 1.9).
Having multiple languages targeting the same virtual machine has a lot of benefits, such as leveraging a common infrastructure, code reuse, shared APIs, the ability to use whatever language is conceptually best for you, or for a specific problem domain, etc.
The same things happens in the .NET space, with multiple languages targeting the CLR. The Parrot (vaporware) VM project also aims to the same thing, and it's a stated goal of the LLVM project too.
The reason is Hotspot.
It is an engineering tour de force.
the other reason not many mentioned is existing infrastructure related to jvm - if you already have a server running java stuff, why not use it instead of bringing in yet another platform (like rails)?
I've encountered this and also been baffled by it, and here's my theory.
Enterprise software is full of Java programmers. Like programmers of all stripes, many Java programmers are convinced that their language is the fastest, the most flexible and the easiest to use--they're not too familiar with other languages but are convinced that those who practice them must be savages and barbarians, because any enlightened person would, of course, use Java.
These people have built vast, complicated Java infrastructures: rube-goldberg machines of frameworks and auto-generated code full of byzantine inheritance structures and very, very large XML files.
So, when someone comes along and says "Hey! Let's use a C interpreted language! It's fast and has neat libraries and is much quicker for scripting and prototyping!" The Java guy is firstly like "I have to run a make file to configure this? QUEL HORREUR!" Then the reality of having to deploy and host this on servers that are running dated OSes and dated versions of Tomcat and nothing else starts to set in.
"Hey, I know! There's a java version of this interpreted language! It may break down in the fast lane on the bridge in rush-hour, and it sometimes catches on fire, but I can get Tomcat to run it. I don't have to dirty my hands with learning non-java stuff, and I can shoehorn it into the existing infrastructure! Win!"
So, is this the "right" reason for choosing a java implementation of a scripting language? Probably not. Depends on your definition of "right". But, I suspect that it's the reason they're chosen more often than snobs like me would like to believe.

Resources