How do I extract the main image from random articles? - machine-learning

I am trying to build a news aggregation system where I will have to process web pages from new news portals everyday. How can I extract the main image of the news article from the webpages without writing html extraction handlers for each portal. How can I guess which is the main image of an article when most of the pages will have 10-15 random ads and side images in it. I tried selecting the largest image in each page but that did not work out well and gave many false positives

There is no such thing as a "main" image on the site. This concept is fully context dependent, in terms of news it could be "the image related to the text", but this is very specific situation - what if there are many images inside the news showing some situation?
As it is very hard to define what you really mean, the machine learning based approach seems reasonable, as "learning by example" should be easier to do.
I would extract the most promising features of each image:
Its relative size to others
Its distance from the news container in the DOM of the webpage
Whether its name containes keywords like "news"; "main"
whether it does not contain "bad" keywords like "ad", "logo", "menu"
And then train the simplest possible classifier on it (Bayes or Logistic Regression) on some collected set of samples.

Related

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Specialization of image classifing model with user image tagging

I have a conceptual question, regarding a software process/architecture setup for machine learning. I have a web app and I am trying incorporate some machine learning algorithms that work like Facebook's face recognition (except with objects in general). So the model gets better at classifying specific images uploaded into my service (like how fb can classify specific persons, etc).
The rough outline is:
event: User uploads image; image attempts to be classified
if failure: draw a bounding box on object in image; return image
interaction: user tags object in box; send image back to server with tag
????: somehow this new image/label pair will fine tune the image classifier
I need help with the last step. Typically in transfer learning or training in general, a programmer has a large database full of images. In my case, I have a pretrained model (google's inception-v3) but my fine-tuning database is non-existent until a user starts uploading content.
So how could I use that tagging method to build a specialized database? I'm sure FB ran into this problem and solved it, but I can find their solution. After some thought (and inconclusive research), the only strategies I can think of is to either:
A) stockpile tagged images and do a big batch train
B) somehow incrementally input a few tagged images as they get
uploaded, and slowly over days/weeks, specialize the image classifier.
Ideally, I would like to avoid option A, but I not sure how realistic B is, nor if there are other ways to accomplish this task. Thanks!
Yes, this sounds like a classic example of online learning.
For deep conv nets in particular, given some new data, one can just run a few iterations of stochastic gradient descent on it, for example. It is probably a good idea to adjust the learning rate if needed as well (so that one can adjust the importance of a given sample, depending on, say, one's confidence in it).
You could also, as you mentioned, save up "mini-batches" with which to do this (depends on your setup).
Also, if you want to allow a little more specialization with your learner (e.g. between users), look up domain adaptation.

Comparing images using OpenCv or something more useful

I need to compare two images in a project,
The images would be two fruits of the same kind -let's say two different images of two different apples-
To be more clear, the database will have images of the stages which an apple takes from the day it was picked from a tree until it gets rotten..
The user would upload an image of the apple they have and the software should compare it to all those images in the database and retrieve the data of the matching image and tell the user at which stage is it...
I did compare before images using OpenCv emgu but I really don't have much knowledge if it's the best way...
I need an expert advise is what i said in the project even possible? or the whole database images' will match the user's image!
And is this "image processing" or something else?
And is there any suggested tutorials to learn how to do this?
I know it seems not totally clear yet, but it's just a crazy idea that I wish I can get a way to know more how i can bring it to life!
N.B the project will be an android application
This is an example of a supervised image classification problem, which is a pretty broad field. You can read up on image classification here.
The way that you would approach this problem would be to define a few stages of decay (fresh, starting to rot, half rotten, completely rotten), put together a dataset of many images of the fruit in each stage, and train an image classifier on each stage. The sample dataset should contain images of many different pieces of fruit in many different settings. If you want to support different types of fruit, you would need to train a different classifier for each fruit.
There are many image classification tools out there. To name a few:
OpenCV's haar classifier
dlib's hog classifier
Matlab's Computer Vision System Toolbox
VLFeat
It would be up to you to look into which approach would work best for your situation.
Given that this is a fairly broad problem, I wouldn't expect to come up with a solid solution quickly unless you've had experience with image classification. If you are trying to develop a product, I would recommend getting in touch with a computer vision expert that you could contract to solve it.
If you are just looking to learn more about image classification, however, this could be a fun way to play around with different tools and get a feel for what's out there. You may want to start by learning about Machine Learning in general. Caltech offers a free online course that gives a pretty good intro to the subject.

Is longer text (eg. article content) or shorter text (eg. article title) better for classification?

I'm currently doing a project to collect and classify news articles, and I'm only interested in a small subset (for example sports-related news) of all the articles collected.
I'm new to Machine Learning and Text Classification. Should I classify the articles based on their titles or actual contents? A human being can usually tell with fair amount of confidence if the news article is relevant by just looking at the title. Hence I'm wondering if titles, instead of content, would give similar or better accuracy in automatic text classification?
The reason for this problem is that overall performance will improve a lot if the program analyses titles first when it finds a link, instead of retrieving every page from the urls and then analyses the contents.
The title is unlikely to provide enough information to classify an article. You can however analyse the title and, if your confident enough that you've got an accurate classification, you can classify it, otherwise look at the content.
Take something like Manchester in trouble. If you don't know that Manchester is a sports team, the article could be economic or political or probably one of few other categories too. I suspect a lot of titles can only easily be classified by people because they're familiar with the Proper Nouns relating to that category, and it could be difficult to get proper training data to train an agent to do this well.
There's no general answer. A lot depends on the algorithms you're going to use. I'd suggest you to start with only a title and try to squeeze the maximum out of it. And if you still fail to achieve the desired quality - try to add the text into the mix.
If we are talking about article's title, then, of course, a very short text is worse for classification, because it contains fewer information. But you could combine analysis of article's title and article's content. This could give you some small increase of accuracy.

Architecture & Essential Components of StumbleUpon's Recommendation Engine

I would like to know how stumbleupon recommends articles for its users?.
Is it using a neural network or some sort of machine-learning algorithms or is it actually recommending articles based on what the user 'liked' or is it simply recommending articles based on the tags in the interests area?. With tags I mean, using something like item-based collaborative filtering etc.?
First, i have no inside knowledge of S/U's Recommendation Engine. What i do know, i've learned from following this topic for the last few years and from studying the publicly available sources (including StumbleUpon's own posts on their company Site and on their Blog), and of course, as a user of StumbleUpon.
I haven't found a single source, authoritative or otherwise, that comes anywhere close to saying "here's how the S/U Recommendation Engine works", still given that this is arguably the most successful Recommendation Engine ever--the statistics are insane, S/U accounts for over half of all referrals on the Internet, and substantially more than facebook, despite having a fraction of the registered users that facebook has (800 million versus 15 million); what's more S/U is not really a site with a Recommendation Engine, like say, Amazon.com, instead the Site itself is a Recommendation Engine--there is a substantial volume of discussion and gossip among the fairly small group of people who build Recommendation Engines such that if you sift through this, i think it's possible to reliably discren the types of algorithms used, the data sources supplied to them, and how these are connected in a working data flow.
The description below refers to my Diagram at bottom. Each step in the data flow is indicated by a roman numeral. My description proceeds backwards--beginning with the point at which the URL is delivered to the user, hence in actual use step I occurs last, and step V, first.
salmon-colored ovals => data sources
light blue rectangles => predictive algorithms
I. A Web Page recommended to an S/U user is the last step in a multi-step flow
II. The StumbleUpon Recommendation Engine is supplied with data (web pages) from three distinct sources:
web pages tagged with topic tags matching your pre-determined
Interests (topics a user has indicated as interests, and which are
available to view/revise by clicking the "Settings" Tab on the upper
right-hand corner of the logged-in user page);
socially Endorsed Pages (*pages liked by this user's Friends*); and
peer-Endorsed Pages (*pages liked by similar users*);
III. Those sources in turn are results returned by StumbleUpon predictive algorithms (Similar Users refers to users in the same cluster as determined by a Clustering Algorithm, which is perhaps k-means).
IV. The data used fed to the Clustering Engine to train it, is comprised of web pages annotated with user ratings
V. This data set (web pages rated by StumbleUpon users) is also used to train a Supervised Classifier (e.g., multi-layer perceptron, support-vector machine) The output of this supervised classifier is a class label applied to a web page not yet rated by a user.
The single best source i have found which discussed SU's Recommendation Engine in the context of other Recommender Systems is this BetaBeat Post.

Resources