Is Erlang the right choice for a webcrawler? - erlang

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
easily scalable on multiple cores and cpus
suited for high I/O loads
fast regular expression matching
easily to maintain/few operational overhead
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?

I am also evaluating erlang for use as a web crawler and it looks good so far.
There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.
And other people are interested in the same use case, so you can learn from them.
However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.

If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:
Don't use regular expressions to parse HTML, use XPATH instead.
HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
Determine what your crawler architecture is going to be and what is your re-visit policy.
Find the best selection policy for you and implement it.
A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.

Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

Related

Cross-platform parser development - What are the options?

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?
I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.
Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.
REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

"Erlang" vs "zeromq+any language" for Embedded Applications

I want to write actor style code for embedded processors and I am trying to decide between writing everything in Erlang vs writing everything in zeromq+any language. Using zeromq looks to be very powerful in the sense that I can use any programming language and make my development a lot easier(many available libs) but then I am not sure if there is any gotcha in this power? I understand that Erlang represent actor model much better especially with OTP concepts but then it seems easy to represent similar actor model with zeromq? Am I looking at this correctly?
1.What do I really lose not using Erlang for embedded applications (where distributed processing, a power point of Erlang, is NOT required) and just build things on top a generic messaging framework like zeromq?
2.Is Erlang offering more than a coordinated messaging framework for a non-distributed embedded application?
3.What specific capabilities of Erlang could took too long to implement with zeromq?
You're comparing apples and oranges. Part of the advantage of using Erlang is the language; if you're going to put it up against zmq + some other language, the other language in that comparison really matters. zmq + ARM assembly? Erlang brings all the wonderful advantages of not hand-coding ASM.
As for what else Erlang brings to the table, Embedded Erlang? Absolutely argues that Erlang has advantages in fault tolerance, hot code loading, rapid development by leveraging Erlang and OTP, easy interaction with C libraries, and simple debugging by live REPL and copy-paste of terms.
Some of those things, such as hot reload, on-device REPL, and established libraries, will definitely take some real hacking to reproduce from the ground up.
My point would be that you will have to work very hard to get the same kind of error handling in Zmq. Erlang has some really nice built-in error handling when things begins to go bad. There has been considerable time spent in Erlang optimizing that part and making it robust.
Zmq on the other hand, is probably faster in some combinations with some languages when you make simple benchmarks. There is less overhead, so it may process messages faster than what Erlang can provide.
But chances are that you will end up re-implementing large parts of Erlang in the language of your choice. And you will probably not do a job as good as 6-10 developers working on Erlang/OTP for 15 years.
On the other hand, Erlang is not a simple language to learn. There is way more to it than just learning how to program in a functional style. Especially the concurrency patterns and failure handling can take some time getting used to.
ZeroMQ =/= Erlang covers many differences. The claim there is that ZeroMQ only provides the messaging aspect, not the light-weight processes, process monitoring and other aspects.

Custom programming language: how?

Hopefully this question won't be too convoluted or vague. I know what I want in my head, so fingers crossed I can get this across in text.
I'm looking for a language with a syntax of my own specification, so I assume I will need to create one myself. I've spent the last few days reading about compilers, lexers, parsers, assembly language, virtual machines, etc, and I'm struggling to sort everything out in terms of what I need to accomplish my goals (file attached at the bottom with some specifications). Essentially, I'm deathly confused as to what tools specifically I will need to use to go forward.
A little background: the language made would hopefully be used to implement a multiplayer, text-based MUD server. Therefore, it needs easy inbuilt functionality for creating/maintaining client TCP/IP connections, non-blocking IO, database access via SQL or similar. I'm also interested in security insofar as I don't want code that is written for this language to be able to be stolen and used by the general public without specialist software. This probably means that it should compile to object code
So, what are my best options to create a language that fits these specifications
My conclusions are below. This is just my best educated guess, so please contest me if you think I'm heading in the wrong direction. I'm mostly only including this to see how very confused I am when the experts come to make comments.
For code security, I should want a language that compiles and is run in a virtual machine. If I do this, I'll have a hell of a lot of work to do, won't I? Write a virtual machine, assembler language on the lower-level, and then on the higher-level, code libraries to deal with IO, sockets, etc myself, rather than using existing modules?
I'm just plain confused.
I'm not sure if I'm making sense.
If anyone could settle my brain even a little bit, I'd sincerely appreciate it! Alternatively, if I'm way off course and there's a much easier way to do this, please let me know!
Designing a custom domain-specific programming language is the right approach to a problem. Actually, almost all the problems are better approached with DSLs. Terms you'd probably like to google are: domain specific languages and language-oriented programming.
Some would say that designing and implementing a compiler is a complicated task. It is not true at all. Implementing compilers is a trivial thing. There are hordes of high-quality compilers available, and all you need to do is to define a simple transform from your very own language into another, or into a combination of the other languages. You'd need a parser - it is not a big deal nowdays, with Antlr and tons of homebrew PEG-based parser generators around. You'd need something to define semantics of your language - modern functional programming langauges shines in this area, all you need is something with a support for ADTs and pattern matching. You'd need a target platform. There is a lot of possibilities: JVM and .NET, C, C++, LLVM, Common Lisp, Scheme, Python, and whatever else is made of text strings.
There are ready to use frameworks for building your own languages. Literally, any Common Lisp or Scheme implementation can be used as such a framework. LLVM has all the stuff you'd need too. .NET toolbox is ok - there is a lot of code generation options available. There are specialised frameworks like this one for building languages with complex semantics.
Choose any way you like. It is easy. Much easier than you can imagine.
Writing your own language and tool chain to solve what seems to be a standard problem sounds like the wrong way to go. You'll end up developing yet another language, not writing your MUD.
Many game developers take an approach of using scripting languages to describe their own game world, for example see: http://www.gamasutra.com/view/feature/1570/reflections_on_building_three_.php
Also see: https://stackoverflow.com/questions/356160/which-game-scripting-language-is-better-to-use-lua-or-python for using existing languages (Pythong and LUA) in this case for in-game scripting.
Since you don't know a lot about compilers and creating computer languages: Don't. There are about five people in the world who are good at it.
If you still want to try: Creating a good general purpose language takes at least 3 years. Full time. It's a huge undertaking.
So instead, you should try one of the existing languages which solves almost all of your problems already except maybe the "custom" part. But maybe the language does things better than you ever imagined and you don't need the "custom" part at all.
Here are two options:
Python, a beautiful scripting language. The VM will compile the language into byte code for you, no need to waste time with a compiler. The syntax is very flexible but since there is a good reason for everything in Python, it's not too flexible.
Java. With the new Xtext framework, you can create your own languages in a couple of minutes. That doesn't mean you can create a good language in a few minutes. Just a language.
Python comes with a lot of libraries but if you need anything else, the air gets thin, quickly. On a positive side, you can write a lot of good and solid code in a short time. One line of python is usually equal to 10 lines of Java.
Java doesn't come with a lot of frills but there a literally millions of frameworks out there which do everything you can image ... and a lot of things you can't.
That said: Why limit yourself to one language? With Jython, you can run Python source in the Java VM. So you can write the core (web server, SQL, etc) in Java and the flexible UI parts, the adventures and stuff, in Python.
If you really want to create your own little language, a simpler and often quicker solution is to look at tools like lex and yacc and similar systems (ANTLR is a popular alternative), and then you can generate code either to an existing virtual machine or make a simple one yourself.
Making it all yourself is a great learning-experience, and will help you understand what goes on behind the scenes in other virtual machines.
An excellent source for understanding programming language design and implementation concepts is Structure and Interpretation of Computer Programs from MIT Press. It's a great read for anyone wanting to design and implement a language, or anyone looking to generally become a better programmer.
From what I can understand from this, you want to know how to develop your own programming language.
If so, you can accomplish this by different methods. I just finished up my own a few minutes ago and I used HTML and Javascript (And DOM) to develop my very own. I used a lot of x.split and x.indexOf("code here")!=-1 to do so... I don't have much time to give an example, but if you use W3schools and search "indexOf" and "split" I am sure that you will find what you might need.
I would really like to show you what I did and past the code below, but I can't due to possible theft and claim of my work.
I am pretty much just here to say that you can make your own programming language using HTML and Javascript, so that you and other might not get their hopes too low.
I hope this helps with most things....

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

Appropriate uses for yacc/byacc/bison and lex/flex

Most of the posts that I read pertaining to these utilities usually suggest using some other method to obtain the same effect. For example, questions mentioning these tools usual have at least one answer containing some of the following:
Use the boost library (insert appropriate boost library here)
Don't create a DSL use (insert favorite scripting language here)
Antlr is better
Assuming the developer ...
... is comfortable with the C language
... does know at least one scripting
language (e.g., Python, Perl, etc.)
... must write some parsing code in almost
every project worked on
So my questions are:
What are appropriate situations which
are well suited for these utilities?
Are there any (reasonable) situations
where there is not a better
alternative to a problem than yacc
and lex (or derivatives)?
How often in actual parsing problems
can one expect to run into any short
comings in yacc and lex which are
better addressed by more recent
solutions?
For a developer which is not already
familiar with these tools is it worth
it for them to invest time in
learning their syntax/idioms? How do
these compare with other solutions?
The reasons why lex/yacc and derivatives seem so ubiquitous today are that they have been around for much longer than other tools, that they have far more coverage in the literature and that they traditionally came with Unix operating systems. It has very little to do with how they compare to other lexer and parser generator tools.
No matter which tool you pick, there is always going to be a significant learning curve. So once you have used a given tool a few times and become relatively comfortable in its use, you are unlikely to want to incur the extra effort of learning another tool. That's only natural.
Also, in the late 1960s and early 1970s when lex/yacc were created, hardware limitations posed a serious challenge to parsing. The table driven LR parsing method used by Yacc was the most suitable at the time because it could be implemented with a small memory footprint by using a relatively small general program logic and by keeping state in files on tape or disk. Code driven parsing methods such as LL had a larger minimum memory footprint because the parser program's code itself represents the grammar and therefore it needs to fit entirely into RAM to execute and it keeps state on the stack in RAM.
When memory became more plentiful a lot more research went into different parsing methods such as LL and PEG and how to build tools using those methods. This means that many of the alternative tools that have been created after the lex/yacc family use different types of grammars. However, switching grammar types also incurs a significant learning curve. Once you are familiar with one type of grammar, for example LR or LALR grammars, you are less likely to want to switch to a tool that uses a different type of grammar, for example LL grammars.
Overall, the lex/yacc family of tools is generally more rudimentary than more recent arrivals which often have sophisticated user interfaces to graphically visualise grammars and grammar conflicts or even resolve conflicts through automatic refactoring.
So, if you have no prior experience with any parser tools, if you have to learn a new tool anyway, then you should probably look at other factors such as graphical visualisation of grammars and conflicts, auto-refactoring, availability of good documentation, languages in which the generated lexers/parsers can be output etc etc. Don't pick any tool simply because "this is what everybody else seems to be using".
Here are some reasons I could think of for using lex/yacc or flex/bison :
the developer is already familiar with lex/yacc or flex/bison
the developer is most familiar and comfortable with LR/LALR grammars
the developer has plenty of books covering lex/yacc but no books covering others
the developer has a prospective job offer coming up and has been told that lex/yacc skills would increase his chances to get hired
the developer could not get buy-in from project members/stake holders for the use of other tools
the environment has lex/yacc installed and for some reason it is not feasible to install other tools
Whether it's worth learning these tools or not will depend heavily (almost entirely on how much parsing code you write, or how interested you are in writing more code on that general order. I've used them quite a bit, and find them extremely useful.
The tool you use doesn't really make as much difference as many would have you believe. For about 95% of the inputs I've had to deal with, there's little enough difference between one and another that the best choice is simply the one with which I'm most familiar and comfortable.
Of course, lex and yacc produce (and demand that you write your actions in) C (or C++). If you're not comfortable with them, a tool that uses and produces a language you prefer (e.g. Python or Java) will undoubtedly be a much better choice. I, for one, would not advise trying to use a tool like this with a language with which you're unfamiliar or uncomfortable. In particular, if you write code in an action that produces a compiler error, you'll probably get considerably less help from the compiler than usual in tracking down the problem, so you really need to be familiar enough with the language to recognize the problem with only a minimal hint about where compiler noticed something being wrong.
In a previous project, I needed a way to be able to generate queries on arbitrary data in a way that was easy for a relatively non-technical person to be able to use. The data was CRM-type stuff (so First Name, Last Name, Email Address, etc) but it was meant to work against a number of different databases, all with different schemas.
So I developed a little DSL for specifying the queries (e.g. [FirstName]='Joe' AND [LastName]='Bloggs' would select everybody called "Joe Bloggs"). It had some more complicated options, for example there was the "optedout(medium)" syntax which would select all people who had opted-out of receiving messages on a particular medium (email, sms, etc). There was "ingroup(xyz)" which would select everybody in a particular group, etc.
Basically, it allowed us to specify queries like "ingroup('GroupA') and not ingroup('GroupB')" which would be translated to an SQL query like this:
SELECT
*
FROM
Users
WHERE
Users.UserID IN (SELECT UserID FROM GroupMemberships WHERE GroupID=2) AND
Users.UserID NOT IN (SELECT UserID GroupMemberships WHERE GroupID=3)
(As you can see, the queries aren't as effecient as possible, but that's what you get with machine generation, I guess).
I didn't use flex/bison for it, but I did use a parser generator (the name of which has escaped me at the moment...)
I think it's pretty good advice to eschew the creation of new languages just to support a Domain specific language. It's going to be a better use of your time to take an existing language and extend it with domain functionality.
If you are trying to create a new language for some other reason, perhaps for research into language design, then these tools are a bit outdated. Newer generators such as antlr, or even newer implementation languages like ML, make language design a much easier affair.
If there's a good reason to use these tools, it's probably because of their legacy. You might already have a skeleton of a language you need to enhance, which is already implemented in one of these tools. You might also benefit from the huge volumes of tutorial information written about these old tools, for which there is not so great a corpus written for newer and slicker ways of implementing languages.
We have a whole programming language implemented in my office. We use it for that. I think it's meant to be a quick and easy way to write interpreters for things. You could conceivably write almost any sort of text parser using them, but a lot of times it's either A) easier to write it yourself quick or B) you need more flexibility than they provide.

Resources