An alternative web crawler to Nutch [closed] - search-engine

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:
using Nutch as the web crawler,
using Solr as the search engine,
the front-end and the site logic is coded with Wicket.
The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.
Questions now:
Any constructive criticism about the hole idea of the site?
Is there a good yet simple alternative to Nutch (as the crawling part of the site)?
Thanks

Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.

For the crawling part, I really like anemone and crawler4j. They both allow you to add your custom logic for links selection and page handling. For each page that you decide to keep, you can easily add the call to Solr.

It depends on how many web sites and so URLs you think crawl. Apache Nutch stores page documents on Apache HBase (which relies on Apache Hadoop), it's solid but very hard to setup and administrate.
Since a crawler is just a page fetch (like a CURL) and retrieve list of links to feed your URLs data base, I am sure you can write a crawler on your own (especially if you have a few web sites), use a simple MySQL database (maybe a queue software like RabbitMQ to schedule the crawl jobs).
On other side, a crawler could be more sophisticated, you could want to remove from your HTML document the HEAD part, and keep only the real "content" of the page etc...
Also, Nutch can rank your pages, with a PageRank algo., you could use Apache Spark to do the same thing (more efficiently because Spark can cache data in memory).

In, C#, but a lot simpler and you can communicate directly with the author. (me)
I used to use Nutch and you are correct; it is a bear to work with.
http://arachnode.net

I do believe the nutch is the best choice for you application, but if you want, there is a simple tool: Heritrix.
Besides that, I recommand js for the front-end language, because solr returns json which is easily handled by js.

Related

Why learn Ruby on Rails [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
One of my college professors said that ruby on rails is used a lot for web, and I'm wondering how much Ruby on Rails is actually used vs JQuery, Node.js, PHP, etc. Also, what are the benefits?
You are mixing some stuff:
Ruby on Rails is a framework to create server side web applications using the Ruby language
jQuery is a client side JavaScript library that simplifies writing JavaScript web clients
Node.js is a server for the execution of server side JavaScript, thus providing a server version of JavaScript
PHP is a language popular for server side web application development
Thus: Ruby on Rails is a mature framework which offers a template engine, MVC architecture, a mapper between language objects and some relational database and a routing facility between URIs and controller.
Similiar designed frameworks exist for many programming languages / environments, e.g. Django for Python, or see Rails-inspired PHP frameworks in case of PHP.
About its popularity, see e.g. http://hotframeworks.com/
Benefits: IMHO it is a very elegant framework and as the plethora of inspired frameworks shows, has found many developers who like it.
The concepts and techniques learned here might also turn out to be useful when working with other modern frameworks.
And I should note as well, that there are web applications that need less features, e.g. see the Sinatra framework for a lighter alternative.
Also, what are the benefits?
There are a lot of things that websites have in common, e.g. html pages with forms, various javascript features, database interactions, security issues, logging in, etc. If you start from scratch, and try to program all those things yourself, it will be difficult and time consuming, and most likely your code will be full of exploitable security holes.
The other option is to use a web framework. Ruby on Rails is a web framework for the ruby programming language. All the various server side programming languages, such as ruby, python, php, perl, java, etc., have web frameworks(and usually many different frameworks to choose from!). A lot of smart people have come up with the best code for various things that websites need, and you get to use their code for free in your website.
The disadvantage of frameworks is that they are often large and complex, e.g. Ruby on Rails, Java Servlets+JSP, so it can take awhile to learn how to use them. Even then, you will probably not have a good grasp of their inner workings, so you are always sort of feeling around in the dark trying to get them to work the way you want them to. It's sort of like trying to push a large boulder which is at rest to another spot of your choosing: sometimes the boulder rolls cleanly into position, and other times the boulder seems to have a mind of its own and wanders off course.

Where can I learn about website load balancing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My initial website will not experience heavy traffic during beta. But assuming success, when traffic builds I will need to implement plans to handle increased traffic from being aware of it to actually dealing with it. I'd like to start studying that now.
But there is an amazing wealth of information about this on the web. I was hoping someone might help me cut through the volumes of information by pointing me in the right direction with articles/walkthroughs/etc that are more practical and less theoretical? And of course any direct guidance on the issue would certainly be appreciated.
I am currently using a hosting provider, not running my own IIS server.
It's very difficult to predict where your scaling bottlenecks will be. If you're missing a database index for example, queries might run slow, and load balancing your web server won't help.
To start with, you should get comfortable with profiling your application. There's a lot of great tools for the backend, including the Visual Studio Profiler, ANTS Profiler and my favorite, dotTrace.
Next (or maybe first, it doesn't matter), you'll want to profile the client side. Chrome Developer Tools works great, or you can use the new Firefox Developer. This will show you response times and how long it's taking to load assets, like your CSS/Javascript/Images/etc.
After doing both, you should have a good idea where your problem is. But in general, the "easiest" way to improve scaling is:
Bundle/minify/compress your asset files. You can remove hundreds of KB from page loads.
Use a CDN. Browsers are limited by domain in the number of simultaneous connections they can make to fetch assets. With a CDN, you can split the requests between domains, and for popular libraries like jQuery, they're more likely to already be cached.
Cache data as appropriate. If there's some mostly static content that never changes, cache it instead of querying your database every time. Take advantage of things like Output Caching, where your entire rendered view is cached.
Check out some of the checklist items in this post.
Once you've taken all those steps, if you still have problems, then you can look at things like load balancing and better hardware. Otherwise, you risk throwing money away when it might not make a difference at all.
It's great to familiarize yourself with all aspects of a web application. However, load balancing is one of those things that can be tricky to set up, but is nigh impossible to set up right without a very comprehensive knowledge of server and networking architecture and experience. Even the big boys like Twitter and Facebook struggle with handling scale. It's very much a learn as you go process and extremely particular to individual circumstances. An absolutely perfect setup for one application may be completely useless for another.
If you're successful enough to need load-balancing, you're likely also successful enough to hire an infrastructure expert to take care of it for you. Short of that, you can utilize services like Azure, which while they have their own learning curve, provide load balancing nearly out of the box.

is node js used only for mobile apps? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I wanna understand the extent of node js?
Is it used only for mobile web apps to handle server side or can it be used to develop a full fledged web app for all device configuration(like replacing ruby and rails).
I found some examples but all seems to be mobile web apps.
Is it like companies like
Aibnb,
Linkeddin... etc
developed two sites, one with nodejs as backend and other with ruby and rails and depending on the device they route to that site.
Please give me some favourable inputs so that my confusion can be calmed.
Node.js is not just for mobile apps. This question How to decide when to use Node.js? gives a good analysis of when to use nodejs.
But, to sum up a bit:
Node is good when you have a lot of short lived requests that don't require heavy CPU processing.
Node is good if you want to use all javascript
One disadvantage with node is that there are a lot of javascript packages that do similar things - the environment isn't as mature or as standardized as other languages (maybe this isn't a negative to you though)
Mobile applications often lend themselves to using node because they often follow the pattern of many short lived requests, e.g. look up something from a database.
There are tradeoffs, such as as that you will be using a fully dynamic language across your entire stack (not for the weak at heart).
So to recap, node is not just for mobile apps, but you should do some research to understand why you might use node.
Node can be used to produce many more solutions than just providing for mobile devices. Some solutions include command line tools (ex. Grunt), applications (ex. crawler), web services (ex. RESTful services), and full-fledged web sites (ex. hummingbird). Lastly if you want an example framework for constructing standard HTML (desktop or otherwise) web apps in node, see Jade.
As to which framework a company chooses often there are several APIs provided. As their web services communicate with standard XML or JSON documents, communication between servers doesn't necessary need to be written in the same language.

What technology for large scale scraping/parsing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database.
What language would you recommend for doing this on a large scale(tens of millions of pages?).
.
We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.
So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.
We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time.
We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.
If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2 for making the requests and lxml for parsing the results.
If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing.
UPDATE:
If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.
You should probably use tools used for testing web applications (WatiN or Selenium).
You can then compose your workflow separated from the data using a tool I've written.
https://github.com/leblancmeneses/RobustHaven.IntegrationTests
You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.
Using TopShelf and NServiceBus you can scale the # of workers horizontally.
FYI: With mono these tools i mention can run on Linux. (although miles may vary)
If JavaScript doesn't need to be evaluated to load data dynamically:
Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.
I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.
The slowest part of the operation is making the http requests.
what about c++? there are many large scale libraries can help you.
boost asio can help you do the network.
TinyXML can parse XML files.
I have no idea about database, but almost all database have interfaces for c++, it is not a problem.

Web development for a Computer Scientist [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have BS in Computer Science, and thus have experience developing software that runs at the command line or with a basic GUI. However, I have no experience making real, functional, websites. It has become apparent to me that I need to expand my skills to encompass web development. I have been using Ruby to develop applications, but I am aware that it is quite popular for web development. I want to use my skills as a programmer to assist me in developing a personal website for a band.
I have experience with HTML, but very little with CSS. I want to leverage my skills with programming languages to create a website containing pictures, audio clips, a dynamic calendar, a scheduling request tool, and other features common to band websites.
What kind of resources are available for a competent desktop programmer to learn the entire process for developing a website? Is it best to use free CSS templates and WordPress as a foundation for my site or make it from scratch? Should I use GUI tools or write it all in Vim/Emacs? Is Ruby on Rails sufficient for my personal website, or should I consider a more mature development platform?
My main goal for this project is to come up to speed on current web design technology, and actually understand the entire process for building a website.
I think one really important thing to understand in web development is HTTP. HTML and CSS are important, but I think it's more critical to understand the stateless nature of the web, and how each of the HTTP verbs work, and what they can/can't do.
http://www.freeprogrammingresources.com/http.html
A good tool for seeing how HTTP works is Fiddler.
If it's as much a learning exercise as anything then take an iterative approach. Build revise. Build revise. My (very) rough guideline below:
Client
Start with the structure of a website and concentrate on the client.
Use notepad and build a bunch of static pages for your band. i.e. Hand code initially. Try to build all your pages with CSS. No table markup. Then play around with some Javascript to bring things to life. (Navigational menu\ Calendar selections\etc). Learn about how to import and link to Javascript and CSS files.... and how these files are treated re:caching etc.
Try to learn up to the limits of what you can do on the client (generally). Factor in the nuances of 3-4 browsers (Firefox/IE6/IE8/Chrome) re:DOM and client side eventing.
Server
Then start looking for what you might want to change across pages/sessions. i.e. what needs to be manipulated server side. And pick a server side technology.
Start with basic post-back processing. Forget databases at this point. Learn how your framework of choice maintains state..... not just the name of the technology but the real nuts and bolts of it. One of your single greatest assets as a web developer is understanding the state model(s) of the technology you're using.
Then go for a deep dive on the web server technology of choice (and in general). Understand the full request pipeline from client to server and back. This will teach you forms, http and its verbs, web server, filters and modules, server to framework hand off, page and control life cycles, back to the client.
Now start working on dynamic content injection and the like. How to make and use reusable components in your web pages.
Databases, caching, performance and diagnostics.
Then get into into all the fun stuff like ajax, etc. Replace your javascript with jQuery, etc.
Then you got the whole Webservices\XML\JSON\etc side of things to discover.
Resources
Well the web obviously. For client side stuff, going to the sites of companies who make third party web controls can be quite interesting. Asking how the hell they did that? Viewsource is your friend. Look at how they structure and build their pages. Pick a couple of good web designer sites, and you find a plethora of rants about browser wars etc that will give you good (under the hood) info.
Once you hit server side, I'd go for white paper type learning from your vendor of choice for your technologies.i.e. webservers/frameworks/etc. Again find a 3rd party howto/evangelist site (I used to use a lot of "4 guys from Rolla" for example) that will demonstrate how to do various things. Language learning is ongoing. Basically just do the best you can till you find a better way.... and always be on the lookout for a better way.
You really need to understand html, forms and css to get anywhere. I say forms as this will give you the round trip needed to understanding the stateless nature of web dev.
To further labour the point, I have interviewed many people who think you can only have one form on a page and can only have one submit button per form. This is all down to a lack of foundation knowledge.
So for that I'd recommend starting with htmldog.com.
After that, a lot of web development is done with frameworks. Gone are the days where you do it yourself (well mostly) but my above point still stands. You do need to be able to peep under the hood with some confidence.
I've been doing web development for 12 years and started out with Perl on Solaris and Linux. Since then I've also done Java and more recently ASP.NET. However, I'm slowly falling for Django in my private projects.
What I've found over the years is that the inherent problems - cookies, javascript, presentation, state, authentication are all the same but just handled differently. So ultimately its down to you and your language preference. Plus a little of enlightened self interest when it comes to potential employment.
Programming aside, you should also become familiar with web servers (Apache and IIS spring to mind), Http codes and headers, Mime types and encoding and FTP. As well as Javascript (mentioned already), plugins, browser platforms and good development practises such as using Firebug, Fiddler and so on. It also wouldn't hurt to have a good idea of the image formats available, image optimisation, CSS sprites, content compression, caching and the like.
All depends on where you want to start!
For a newbie, I'd pick Django and (obviously) Python. Good, clean language with cheap startup options, low cost IDEs (ie free) and hosting your sites is very affordable.
But that's just a subjective opinion.
If your goal is to
My main goal for this project is to
come up to speed on current web design
technology, and actually understand
the entire process for building a
website.
Then start from scratch in Ruby, PHP, Java, ASP.NET, etc...
When you run into a design problem or just want to know how others have approached something, then look at the frameworks.
Once your up to speed, and your website is starting to grow, then segway into a framework, to get up to speed on the frameworks.
I agree with John on this one.
As you know from your own experience in pursuing your BSc, understanding the basics of any language is what makes you even more capable in expanding that knowledge or specializing.
With that in mind, it would be best to understand the basics of HTML and CSS.
Understanding the syntax and overall language will help in the future when you want to pursue large projects using frameworks like Django and Rails. The basics will also especially help with tweaking CMS' like Wordpress to be more customizable to your needs.
One thing in particular that I'd like to mention is that web programming, like many other forms of programming has its own special structure and "proper" way of doing things.http://www.w3.org is a great way to ensure that your work is passing general web design standards, most sites don't follow this because it is tedious, but from a learning perspective it ensures that you get a nice strong foundation.
www.w3schools.com is also a great resource for detailed help on web programming. Lastly, I like colourful code, so I like using basic text editors such as notepad++ or notepad2 or gedit to do my web coding. GUIs like dreamweaver may tend to fill up your code with extra junk and spaces, so I don't recommend them, but they are still great tools.
Don't bother with Rails yet -- write CGI scripts in Ruby. It will be very similar to what you have done for class.
After you have about thirty of those under your belt, you'll know what you want out of a web framework.
I'm a Computer Scientist and a web programmer and I would suggest you learn both HTTP and CGI:
CGI Made Really Easy
HTTP Made Really Easy
As the titles of the above tutorials claim, they made the concepts "really easy" for me.
Once you've got CGI and HTTP down pat, I'd suggest checking out following sites that provide a wealth of articles and references for web programming:
webmonkey
w3schools
Mozilla Developer Center
Assuming you want to concentrate on writing web apps, then Perl, PHP, Python and Ruby are all fine choices (I myself use Perl predominantly) and I'd suggest doing some research into the popular web frameworks available for each language.
Most importantly, pick something simple as your first web app, e.g. a form and a page that shows the results of submitting that form. Some good examples (using Perl's CGI module) can be found here:
CGI.pm - a Perl5 CGI Library -- see the first set of links on this page.
When you want to start writing web apps that use a database, read up on SQL and popular libraries/modules in your chosen language for database manipulation, especially ORM (Object-Relational-Mapping) interfaces that allow you to deal with records in an Object-Oriented fashion.
Good luck with it! Being a web programmer is fun because your audience is teh intarwebz! :)
If you are starting from scratch as per John MacIntyre's suggestion, you may lean towards PHP. With all of its shortcomings, it does have one really good user manual. It is also easy to get started with and is installed on pretty much every host and goes well with Apache.
Also, w3 schools is good to begin learning about CSS and XHTML but don't forget to check out the specs at W3C.
Also, please read this Stack Overflow question & answers.
For what you're describing, Rails or Django might be slight overkill but it wouldn't hurt to learn them. Django, in particular, might be good because of the notion of a project containing multiple apps (e.g. calendar).
Whether you use a framework or write everything yourself, though, you'll need to know HTML and CSS. CSS is extremely simple if you have a BS in CS...you could read a tutorial and know it in five minutes.

Resources