What technology for large scale scraping/parsing? [closed] - parsing

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database.
What language would you recommend for doing this on a large scale(tens of millions of pages?).
.
We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.
So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.
We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time.
We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.

If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2 for making the requests and lxml for parsing the results.
If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing.
UPDATE:
If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.

You should probably use tools used for testing web applications (WatiN or Selenium).
You can then compose your workflow separated from the data using a tool I've written.
https://github.com/leblancmeneses/RobustHaven.IntegrationTests
You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.
Using TopShelf and NServiceBus you can scale the # of workers horizontally.
FYI: With mono these tools i mention can run on Linux. (although miles may vary)
If JavaScript doesn't need to be evaluated to load data dynamically:
Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.

I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.
The slowest part of the operation is making the http requests.

what about c++? there are many large scale libraries can help you.
boost asio can help you do the network.
TinyXML can parse XML files.
I have no idea about database, but almost all database have interfaces for c++, it is not a problem.

Related

Why learn Ruby on Rails [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
One of my college professors said that ruby on rails is used a lot for web, and I'm wondering how much Ruby on Rails is actually used vs JQuery, Node.js, PHP, etc. Also, what are the benefits?
You are mixing some stuff:
Ruby on Rails is a framework to create server side web applications using the Ruby language
jQuery is a client side JavaScript library that simplifies writing JavaScript web clients
Node.js is a server for the execution of server side JavaScript, thus providing a server version of JavaScript
PHP is a language popular for server side web application development
Thus: Ruby on Rails is a mature framework which offers a template engine, MVC architecture, a mapper between language objects and some relational database and a routing facility between URIs and controller.
Similiar designed frameworks exist for many programming languages / environments, e.g. Django for Python, or see Rails-inspired PHP frameworks in case of PHP.
About its popularity, see e.g. http://hotframeworks.com/
Benefits: IMHO it is a very elegant framework and as the plethora of inspired frameworks shows, has found many developers who like it.
The concepts and techniques learned here might also turn out to be useful when working with other modern frameworks.
And I should note as well, that there are web applications that need less features, e.g. see the Sinatra framework for a lighter alternative.
Also, what are the benefits?
There are a lot of things that websites have in common, e.g. html pages with forms, various javascript features, database interactions, security issues, logging in, etc. If you start from scratch, and try to program all those things yourself, it will be difficult and time consuming, and most likely your code will be full of exploitable security holes.
The other option is to use a web framework. Ruby on Rails is a web framework for the ruby programming language. All the various server side programming languages, such as ruby, python, php, perl, java, etc., have web frameworks(and usually many different frameworks to choose from!). A lot of smart people have come up with the best code for various things that websites need, and you get to use their code for free in your website.
The disadvantage of frameworks is that they are often large and complex, e.g. Ruby on Rails, Java Servlets+JSP, so it can take awhile to learn how to use them. Even then, you will probably not have a good grasp of their inner workings, so you are always sort of feeling around in the dark trying to get them to work the way you want them to. It's sort of like trying to push a large boulder which is at rest to another spot of your choosing: sometimes the boulder rolls cleanly into position, and other times the boulder seems to have a mind of its own and wanders off course.

Why use Node.js if I don't require real-time functionality? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm considering using Node.js with a framework such as express, meteor, or sails (a directory with social features such as sharing, messaging and uploading media). I don't have any features planned that explicitly require real-time functionality, so does it make sense to use Node.js anyway instead of Rails?
There's so much buzz around Node.js that I am tempted to use it just so that I don't get left behind.
As DHH wisely noticed regarding Node vs Rails, "everything can be used instead of everything else". That's somewhat true in a sense that, for example, a site in Rails with promptly set up caching can be as fast as one written in Node.js.
Besides, Node is not necessarily about real-time. It's more about being able to handle many light (in terms of processing time needed) requests. If you expect high level of concurrency (I mean, really, expect, not just are dreaming of it) and every request is supposed to be relatively small, then you could consider using Node, just because handling bigger load (up to some point) will require less work with Node.
Bottom line, use what you are good at. Unless you want to try something new. And Node.js is definitely worth trying.
You're question does lack quite a bit of context.
This question all depends on the context.
If this is contract work or something you want to make money with in the near future and you're not sufficiently skilled with any of the mentioned nodejs frameworks.
Then I would recommend you use whatever you're already good at.
If this is a private project for fun or any other non serious purpose.
Then I would seriously recommend you to try one of mentioned nodejs frameworks.
In my opinion nodejs is currently the cutting edge web technology. As a developer
you should always try to stay on the cutting edge. That way when you learn how nodejs can be used you might find ways to use those things in your professional environment.
I've lately been using meteor a lot and I can highly recommend it, once you get the hang of it you can do truly amazing things that you could never even imagine doing(in a reasonable timespan) in a classic php project.
Also according to some meteor will replace RoR alltogether blog
Aside from real-time, main plus I've seen is having same team being able to develop JavaScript client code and server side code for UI web applications. "Code sharing" between client and server seems a pipe dream to me, but same language is really nice.

An alternative web crawler to Nutch [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:
using Nutch as the web crawler,
using Solr as the search engine,
the front-end and the site logic is coded with Wicket.
The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.
Questions now:
Any constructive criticism about the hole idea of the site?
Is there a good yet simple alternative to Nutch (as the crawling part of the site)?
Thanks
Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.
For the crawling part, I really like anemone and crawler4j. They both allow you to add your custom logic for links selection and page handling. For each page that you decide to keep, you can easily add the call to Solr.
It depends on how many web sites and so URLs you think crawl. Apache Nutch stores page documents on Apache HBase (which relies on Apache Hadoop), it's solid but very hard to setup and administrate.
Since a crawler is just a page fetch (like a CURL) and retrieve list of links to feed your URLs data base, I am sure you can write a crawler on your own (especially if you have a few web sites), use a simple MySQL database (maybe a queue software like RabbitMQ to schedule the crawl jobs).
On other side, a crawler could be more sophisticated, you could want to remove from your HTML document the HEAD part, and keep only the real "content" of the page etc...
Also, Nutch can rank your pages, with a PageRank algo., you could use Apache Spark to do the same thing (more efficiently because Spark can cache data in memory).
In, C#, but a lot simpler and you can communicate directly with the author. (me)
I used to use Nutch and you are correct; it is a bear to work with.
http://arachnode.net
I do believe the nutch is the best choice for you application, but if you want, there is a simple tool: Heritrix.
Besides that, I recommand js for the front-end language, because solr returns json which is easily handled by js.

What are some interesting projects to solve in Erlang for learning purposes? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I recently discovered Erlang and am now working my way through a couple of tutorials. By now I'm looking forward to actually implement something as a hobby project. I'm not really interested in yet another chat server. I would like to code something more interesting (yes I'm aware that this is a rather fuzzy term) which is also manageable, so I can finish it in my spare time.
Any suggestions?
Edit: The project should preferably highlight Erlang's strenghts (concurrency, distributed).
Build a distributed system that searches twitter feeds in real time and allows anyone to perform searches from a web front end.
Build a distributed file system. Implement distributed B*Trees or B+Trees as the base of this file system. Do it in erlang.
Build a distributed key value store on top of the distributed file system built in step 2.
Build a distributed web index (to be used by a distributed web search engine) on top of the key value store.
Build a distributed linker. Advanced build automation offers remote agent processing for distributed builds and/or distributed processing.
Build a MMORPG backend that relies on distributed storage of the game/player state and distributed processing of user requests.
For something for yourself, consider writing a simple server; something that, for example, services date/time requests or -- a little fancier -- an HTTP daemon that serves only static content.
The best part of Erlang is the way it handles concurrency; exercize that.
Project Euler, for sure.
Some things from my copious ToDo list that would both be good learning exercises and helpful to the erlang community at large:
Profile all the available Key/Value stores:
Write a library for testing insert, lookup, delete, search times for a variety of K/V stores
Create a benchmark suite people can run
Make it work with ets, dets, proplists, gb_trees, dict, orddict, redblack trees, bdb, tokyocabinet, ...
Produce pretty graphs
Make it easy to update, contribute to and run on anyone's machine
write a new io_lib:format routine that uses named parameters:
io_lib:nformat("Hi there ~{name}s~n.", [{name, "Bob"}]).
This is useful for internationalisation if the position of parameters changes when the language of the format string changes.
Extend erl -make (make.erl)
Allow adding code paths (so that you don't need to do erl -pa LibraryPath -make)
Compile/load behaviour modules before modules that implement those behaviours
Handle hierarchal modules correctly (output path in particular)
This doesn't exactly answer your question, but if you are looking for an interesting free, open-source project that is written in Erlang, you should definitely check out CouchDB. From the website:
Apache CouchDB is a distributed,
fault-tolerant and schema-free
document-oriented database accessible
via a RESTful HTTP/JSON API. Among
other features, it provides robust,
incremental replication with
bi-directional conflict detection and
resolution, and is queryable and
indexable using a table-oriented view
engine with JavaScript acting as the
default view definition language.
CouchDB is written in Erlang, but can
be easily accessed from any
environment that provides means to
make HTTP requests. There are a
multitude of third-party client
libraries that make this even easier
for a variety of programming languages
and environments.
The CouchDB website has more details. Happy coding!
find something erlang doesn't have that you understand and like. I did that with etap https://github.com/ngerakines/etap/ Now nick has taken over management and it's used internally at EA games. It was fun to make and like a previous poster it was something real so I learned to serve real world problems working on it.
File indexing/search system. This was going to by intro project but I've switched over to something else.
Once you've got it working you could move the indexes to mnesia, and then spread the thing out other nodes to a have a whole network index.

Web development for a Computer Scientist [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have BS in Computer Science, and thus have experience developing software that runs at the command line or with a basic GUI. However, I have no experience making real, functional, websites. It has become apparent to me that I need to expand my skills to encompass web development. I have been using Ruby to develop applications, but I am aware that it is quite popular for web development. I want to use my skills as a programmer to assist me in developing a personal website for a band.
I have experience with HTML, but very little with CSS. I want to leverage my skills with programming languages to create a website containing pictures, audio clips, a dynamic calendar, a scheduling request tool, and other features common to band websites.
What kind of resources are available for a competent desktop programmer to learn the entire process for developing a website? Is it best to use free CSS templates and WordPress as a foundation for my site or make it from scratch? Should I use GUI tools or write it all in Vim/Emacs? Is Ruby on Rails sufficient for my personal website, or should I consider a more mature development platform?
My main goal for this project is to come up to speed on current web design technology, and actually understand the entire process for building a website.
I think one really important thing to understand in web development is HTTP. HTML and CSS are important, but I think it's more critical to understand the stateless nature of the web, and how each of the HTTP verbs work, and what they can/can't do.
http://www.freeprogrammingresources.com/http.html
A good tool for seeing how HTTP works is Fiddler.
If it's as much a learning exercise as anything then take an iterative approach. Build revise. Build revise. My (very) rough guideline below:
Client
Start with the structure of a website and concentrate on the client.
Use notepad and build a bunch of static pages for your band. i.e. Hand code initially. Try to build all your pages with CSS. No table markup. Then play around with some Javascript to bring things to life. (Navigational menu\ Calendar selections\etc). Learn about how to import and link to Javascript and CSS files.... and how these files are treated re:caching etc.
Try to learn up to the limits of what you can do on the client (generally). Factor in the nuances of 3-4 browsers (Firefox/IE6/IE8/Chrome) re:DOM and client side eventing.
Server
Then start looking for what you might want to change across pages/sessions. i.e. what needs to be manipulated server side. And pick a server side technology.
Start with basic post-back processing. Forget databases at this point. Learn how your framework of choice maintains state..... not just the name of the technology but the real nuts and bolts of it. One of your single greatest assets as a web developer is understanding the state model(s) of the technology you're using.
Then go for a deep dive on the web server technology of choice (and in general). Understand the full request pipeline from client to server and back. This will teach you forms, http and its verbs, web server, filters and modules, server to framework hand off, page and control life cycles, back to the client.
Now start working on dynamic content injection and the like. How to make and use reusable components in your web pages.
Databases, caching, performance and diagnostics.
Then get into into all the fun stuff like ajax, etc. Replace your javascript with jQuery, etc.
Then you got the whole Webservices\XML\JSON\etc side of things to discover.
Resources
Well the web obviously. For client side stuff, going to the sites of companies who make third party web controls can be quite interesting. Asking how the hell they did that? Viewsource is your friend. Look at how they structure and build their pages. Pick a couple of good web designer sites, and you find a plethora of rants about browser wars etc that will give you good (under the hood) info.
Once you hit server side, I'd go for white paper type learning from your vendor of choice for your technologies.i.e. webservers/frameworks/etc. Again find a 3rd party howto/evangelist site (I used to use a lot of "4 guys from Rolla" for example) that will demonstrate how to do various things. Language learning is ongoing. Basically just do the best you can till you find a better way.... and always be on the lookout for a better way.
You really need to understand html, forms and css to get anywhere. I say forms as this will give you the round trip needed to understanding the stateless nature of web dev.
To further labour the point, I have interviewed many people who think you can only have one form on a page and can only have one submit button per form. This is all down to a lack of foundation knowledge.
So for that I'd recommend starting with htmldog.com.
After that, a lot of web development is done with frameworks. Gone are the days where you do it yourself (well mostly) but my above point still stands. You do need to be able to peep under the hood with some confidence.
I've been doing web development for 12 years and started out with Perl on Solaris and Linux. Since then I've also done Java and more recently ASP.NET. However, I'm slowly falling for Django in my private projects.
What I've found over the years is that the inherent problems - cookies, javascript, presentation, state, authentication are all the same but just handled differently. So ultimately its down to you and your language preference. Plus a little of enlightened self interest when it comes to potential employment.
Programming aside, you should also become familiar with web servers (Apache and IIS spring to mind), Http codes and headers, Mime types and encoding and FTP. As well as Javascript (mentioned already), plugins, browser platforms and good development practises such as using Firebug, Fiddler and so on. It also wouldn't hurt to have a good idea of the image formats available, image optimisation, CSS sprites, content compression, caching and the like.
All depends on where you want to start!
For a newbie, I'd pick Django and (obviously) Python. Good, clean language with cheap startup options, low cost IDEs (ie free) and hosting your sites is very affordable.
But that's just a subjective opinion.
If your goal is to
My main goal for this project is to
come up to speed on current web design
technology, and actually understand
the entire process for building a
website.
Then start from scratch in Ruby, PHP, Java, ASP.NET, etc...
When you run into a design problem or just want to know how others have approached something, then look at the frameworks.
Once your up to speed, and your website is starting to grow, then segway into a framework, to get up to speed on the frameworks.
I agree with John on this one.
As you know from your own experience in pursuing your BSc, understanding the basics of any language is what makes you even more capable in expanding that knowledge or specializing.
With that in mind, it would be best to understand the basics of HTML and CSS.
Understanding the syntax and overall language will help in the future when you want to pursue large projects using frameworks like Django and Rails. The basics will also especially help with tweaking CMS' like Wordpress to be more customizable to your needs.
One thing in particular that I'd like to mention is that web programming, like many other forms of programming has its own special structure and "proper" way of doing things.http://www.w3.org is a great way to ensure that your work is passing general web design standards, most sites don't follow this because it is tedious, but from a learning perspective it ensures that you get a nice strong foundation.
www.w3schools.com is also a great resource for detailed help on web programming. Lastly, I like colourful code, so I like using basic text editors such as notepad++ or notepad2 or gedit to do my web coding. GUIs like dreamweaver may tend to fill up your code with extra junk and spaces, so I don't recommend them, but they are still great tools.
Don't bother with Rails yet -- write CGI scripts in Ruby. It will be very similar to what you have done for class.
After you have about thirty of those under your belt, you'll know what you want out of a web framework.
I'm a Computer Scientist and a web programmer and I would suggest you learn both HTTP and CGI:
CGI Made Really Easy
HTTP Made Really Easy
As the titles of the above tutorials claim, they made the concepts "really easy" for me.
Once you've got CGI and HTTP down pat, I'd suggest checking out following sites that provide a wealth of articles and references for web programming:
webmonkey
w3schools
Mozilla Developer Center
Assuming you want to concentrate on writing web apps, then Perl, PHP, Python and Ruby are all fine choices (I myself use Perl predominantly) and I'd suggest doing some research into the popular web frameworks available for each language.
Most importantly, pick something simple as your first web app, e.g. a form and a page that shows the results of submitting that form. Some good examples (using Perl's CGI module) can be found here:
CGI.pm - a Perl5 CGI Library -- see the first set of links on this page.
When you want to start writing web apps that use a database, read up on SQL and popular libraries/modules in your chosen language for database manipulation, especially ORM (Object-Relational-Mapping) interfaces that allow you to deal with records in an Object-Oriented fashion.
Good luck with it! Being a web programmer is fun because your audience is teh intarwebz! :)
If you are starting from scratch as per John MacIntyre's suggestion, you may lean towards PHP. With all of its shortcomings, it does have one really good user manual. It is also easy to get started with and is installed on pretty much every host and goes well with Apache.
Also, w3 schools is good to begin learning about CSS and XHTML but don't forget to check out the specs at W3C.
Also, please read this Stack Overflow question & answers.
For what you're describing, Rails or Django might be slight overkill but it wouldn't hurt to learn them. Django, in particular, might be good because of the notion of a project containing multiple apps (e.g. calendar).
Whether you use a framework or write everything yourself, though, you'll need to know HTML and CSS. CSS is extremely simple if you have a BS in CS...you could read a tutorial and know it in five minutes.

Resources