Machine learning to understand website structure in Python - machine-learning

I have been working on crawling webpages and extracting the elements of the website.
Ex:
Given a website - The crawler should return the following sections: Header, Menu, Footer, content etc.
I was thinking that it would be great if I could use machine learning to train the code to learn how to classify websites.
I tried looking at Python Machine learning libraries (ex: PyBrain) but the examples are very complex.
Can anyone please suggest me a library and some tutorial on how to get started on using Python Machine Learning with some simple examples?
Thanks!

MLPy may be a simpler start for you.
Here is a link to the documentation on classification. By the way, if you don't know what the classes should look like, maybe you need to cluster your pages, and not to classify them.

Related

How do I make script to teach machine how to play game like this? (Youtube)

https://www.youtube.com/channel/UCXe-BTXAnQ9VaQQZnlC608A
This guy made machine learning script and teach machine
how to play Super Mario and complete each level.
There's FAQ document in the description of every his video
that he's using LUA to make this script but I don't even know where
to start and can't find any tutorial on youtube how to make something like this
My goal is make machine learning script for other games and see
the machine learning how to play and complete various levels
Could you please guide me where I can start and what I should learn to make script like this?
and I prefer programming language easier to learn if there is other option.
Consider the following:
Are you a beginner programmer? If so, diving into machine learning will prove to be frustrating. You need a solid foundation first in programming principles.
The next level would be to use a machine learning library. The core ML algorithms are then already written and optimized and you just need to learn how to integrate them into your own programs. Don't be fooled though - to do this well still requires a solid understanding of ML principles. There are existing libraries for Lua and Python. That is a good place to start.
I would consider looking at Andrew Ng's introductory material (book and videos) on ML. Warning! He knows what he is talking about and does offer up quite a bit of advanced material as well. Don't start with that.
There are excellent, advanced books on the subject that explain the mathematical principles behind the apparent magic of ML. If you already did all of the above and are still sticking with it, then this would be the way to go next. You could work on your own implementation of ML and apply that to any domain.
I'm gonna guess that you are at the first bullet point still. Be patient. Learn to program well. Get exposure to ML ideas. Plan on that video game ML application some years from now.

Need answer about some Machine Learning related questions?

Recently, we planned to build a system for image processing to extract info from images. At present we are using AWS Rekognition to do that. But, in some cases, we are not getting accurate information from AWS. So, we've planned to build our own custom one.
We've 4/5 months to do that. At least a POC version. Also, we've planned to use Tensorflow for that. We all have no prior experience about Machine Learning & Deep Learning but already have 5/6yrs of experience on Computer Programming by using different languages.
Currently, I'm studying ML from a course of Udemy & my approach to solve this problem is...
Learn Machine Learning(ML)
Learn Deep Learning(DL)
Above ML & DL maybe I'll be ready to understand the whole thing & can able to build a system for Image Processing.
In abstract what I've understood is, I've to write one Deep Learning program in Python by using Tensorflow. By using that Program I've to build a Model. Then I've to train that Model by using some training data. Then, when my Model achieves a certain level of accuracy I'll use some test data.
Now, there some places at where I've bit confused & here are my questions regarding that confusion...
I know tensorflow is a library but at some places, it's also mentioned as a system. So, is it really a library(piece of code) only & something more than that?
I got some Image Processing Python code in Tensorflow tutorial section (https://www.tensorflow.org/tutorials/image_recognition). We've tested that code & it's working exactly the way AWS Recognition service work. So, here my doubt is... can I use this Python code as it is in our production work?
After train a model with some training data does those training data get part of the whole system or Machine Learning Model extract some META info from those training data & keep with itself rather whole raw training data(in my case it'll be raw images).
Can I do all these ML+DL programmings over my Linux System? It has Pentium 4 with 8GB RAM.
Also, want to know... the approach which I've mentioned to build a solution for my problem is sufficient or I need to do something else also.
Need some guidance to clear out all these confusion.
Thanks
1 : tensor-flow is like anything else we have been worked with (like Numpy ) but only difference is we have to first defined what we want to use the use it , every thing in tensor-flow are running into a computational graph and evaluating every thing in that graph require a Session , we could call it library because it just piece of code and have interface in python , and system because of all those mechanism it uses
2 :
can I use this Python code as it is in our production work? Why not !
3:
yes you could do that with your system , but the main advantage of tensor-flow and theano , .. the tool like those is that you could run your code on GPU it a more faster way than on CPU because the GPU could handle a lot more matrix multiplication and stuff like that
4:
you know you don't have to learn all the machine learning stuff to built a image recognition system , it may be take years for you to understand whats going on there , Udemy course is very good source but you I highly recommend you to see the machine learning courses of coursera , there is to courses there about machine learning : the great Andrew NG course and Emily fox course , the first one is more theoretical than practical , but second on is more practical ,
and about the Deep learning , there is nothing fancy about Deep learning and it's just a method in machine learning , after you gain some experience in machine learning and understood some basic or you could do it right know , go to fast.ai , it has a really good course about deep learning for coder and it's also free
I hope this will help you

Should I implement Content-based Recommender from scratch or use Machine learning library like mahout?

I am new to apache mahout but i read one article which said Apache Mahout 1.0 gives content based recommendion (http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) but now it turns out that it does not give content-based recommendation rather it gives recommendation based on different user actions on website.
Amazon ,Netflix might have been using content-based recommender and probabily they might have implemented them from scratch but now my question is:
Is there any Machine Learning library which gives us content-based recommendation or do i have to implement it by myself ?
Here by content based recommendation i mean there is feature vector for item and we behaviour vector for each user and hence by multiplying them we get recommendation for particular user.
Please recommend something to me,
Thanks in advance.

Web page recommender system

I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.
So my questions are:
How can I possibly trawl the internet to find related web-pages?
And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?
Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit?
as Thomas Jungblut said, one could write several books on your questions ;-)
I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...
Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.
http://scrapy.org/
http://crawler.archive.org/
http://code.google.com/p/crawler4j/
https://metacpan.org/module/WWW::Robot
http://code.google.com/p/boilerpipe/
First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit.
For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.
http://mallet.cs.umass.edu/
http://mahout.apache.org/
http://hunch.net/~vw/
http://lucene.apache.org/
Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.
http://lenskit.grouplens.org/
http://ismll.de/mymedialite
https://github.com/jcnewell/MyMediaLiteJava
This should be a good read: Google news personalization: scalable online collaborative filtering
It's focused on collaborative filtering rather than content based recommendations, but it touches some very interesting points like scalability, item churn, algorithms, system setup and evaluation.
Mahout has very good collaborative filtering techniques, which is what you describe as using the behaviour of the users (click, read, etc) and you could introduce some content based using the rescorer classes.
You might also want to have a look at Myrrix, which is in some ways the evolution of the taste (aka recommendations) portion of Mahout. In addition, it also allows applying content based logic on top of collaborative filtering using the rescorer classes.
If you are interested in Mahout, the Mahout in Action book would be the best place to start.

Good research papers and tutorials for creation of image processing tool/application (free)

I'm looking for good research papers and tutorials for creation of image processing tool/application which are free. They may not have full-blown description. Papers and tutorials dedicated to a single feature are good enough. Thanks in advance.
There's not any papers for it, but if you take a look at the source code for AForge.NET you will be able to see how several image processing algorithms are implemented.
The project comprises of the core library and a GUI application that lets you try out the filters. So it will give you and idea of what is involved.

Resources