I'm indexing websites' content and I want to implement some categorization based solely on the urls.
I would like to tell appart content view pages from navigation pages.
By 'content view pages' I mean webpages where one can typically see the details of a product or a written article.
By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.
Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.
In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how.
Machine learning appear to be a broad topic, what should I start reading about in particular?
Which concepts, which algoritms, which tools?
If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.
After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.
You firstly need to collect a dataset of navigation / content pages and label them. After that its quite straight forward.
What language will you be using? I'd suggest you try Weka which is a java based tool in which you can simply press a button and get back performance measures of 50 odd algorithms from. After that you will know which is the most accurate and can deploy that.
I feel like you are trying to classify the Authority and Hub in a HITS algorithm.
Hub is your navigation page;
Authority is your content view page.
By doing a link analysis of every web pages, you should be able to find out the type of page by performing HITS on all the webpages in a domain. As shown in below graphs, the left graph shows the link relation between webpages. The right graph shows the scoring with respective to hub/authority after running HITS. HITS does not need any label to start. The updating rule is simple: basically just one update for authority score and another update for hub score.
Here is a tutorial discussing pagerank/HITS where I borrowed the above two graphs.
Here is an extended version of HITS to combine HITS and information retrieval methods (TF-IDF, vector space model, etc). This looks much more promising but certainly it needs more work. I suggest you start with naive HITS and see how good it is. On top of that, try some techniques mentioned in BHITS to improve your performance.
Related
For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.
Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.
My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.
I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.
I'm open to other suggestions as well. For instance, I've considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don't want this to be a limitation.
Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or elements that have certain characteristics, like having a list within them, which is something most recipes have.
this framework might be useful to you: https://github.com/Xpitfire/symbolicai
The basic idea is:
You could stream among your input data and build up a stack on the side.
Next, in your training procedure, you need to account for having loosely connected chunks of data. This you could overcome by indexing or clustering the chunks before designing your prompts.
This means, if you want to create a query for a question that is related to your long data stream, you could search through your indexes and retrieve the related information.
Now you need to parse together your few-shot learning prompt that accounts for a "section" in your prompt that relates to your query and another one for the facts you wanted to include.
Finally, you can then feed that into your model and provide examples of what you want your model to be tuned to.
I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.
I'm trying to build a vertical (meta) search engine for a particular industry. I'm trying to do somthing similar to "indeed.com" (job search engine) and "hotelscombined.com" (hotel search engine). I would like to know how do these two search engines build up their search results?
1) Is it using APIs of the other websites they serve results from? (seems odd to me since some results come from small and primitive sites).
2) Do other website post updates to these search engines? (Also seems odd as above)
3) Do they internally understand and create a map for each website they serve results from? (if so, then maybe they need to constantly monitor the structure of these sites for any changes. Seems error prone to me).
4) Any other possibilities?
I don't know even where to start, so any pointers in the right direction is much appreciated. (books, tutorials, hints, ideas...)
Thanks
It is mostly a mix of 1 and 3. Ideally, the site will have some sort of API they expose and document. If not, you have to do data scraping. Basically, you reverse-engineer their page. If they get results asynchronously via an undocumented API, you can use that API as well as (until they make a breaking change). Otherwise, it's simply a matter of pulling the text straight out of the HTML.
I don't know of any more advanced techniques since I don't do this myself, but several of my acquaintances have gone on to work on mobile apps that need to do this sort of thing with sports scores and such (not for searching, but same requirements - get someone else's data into our database). The low tech "pull it from the HTML until they change the HTML and break everything" is standard practice where they work.
2 is possible, but to do it you have to either make business arrangements with every source of data you want to use, or gain enough market presence for everyone to want to upload their data.
Also, you don't do this while actually searching (unless you have other constraints as Charles Duffy points out in his comment). You run a process that regularly goes out, gets all the data it can find, and inserts it into your own database, which you then search. This allows you to decouple data gathering from data searching - your search page won't have to know about and handle errors from the scraper, and the scraper has to only "get all the data" from each source instead of being able to transform queries from your site to search each source.
I'm not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn't language specific either but may have some sort of pseudo algorithm as an answer.
Basically I'm trying to learn about how spiders work and from what I can tell no spider I've found manages hierarchy. They just list the content or the links but no ordering.
My question is this: we look at a site and can easily determine visually what links are navigational, content related or external to a site.
How could we automate this? How could we pro grammatically help a spider detemine parent and child pages.
Of course the first answer would be to use the URL's directory structure.
E.g www.stackoverflow.com/questions/spiders
spiders is child of questions, questions is child of base site and so on.
But nowadays hierarchy is usually flat with ids being referenced in URL.
So far I have 2 answers to this question and would love some feedback.
1: Occurrence.
The links that occur the most in all pages would be dubbed as navigational. This seems like the most promising design but I can see issues popping up with dynamic links and others but they seem minuscule.
2: Depth.
Example is how many times do I need to click on a site to get to a certain page. This seems doable but if some information is advertised on the home page that is actually on the bottom level, it would be determined as a top level page or node.
So has anyone got any thoughts or constructive criticism on how to make a spider judge hierarchy in links.
(If anyone is really curious, the back end part of the spider will most likely be Ruby on rails)
What is your goal? If you want to crawl smaller number of websites and extract useful data for some kind of aggregator, its best to build focused crawler(Write crawler for every site).
If you want to crawl milion of pages ... Well than you must be very familiar with some advanced concepts from AI.
You can start from this article http://www-ai.ijs.si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf
I know that some website are applications, but not all websites are applications (albeit maybe just a brochure viewing site)
Is there an in depth dummy use case for a brochure type site which would be beneficial to use.
When it comes to a corporate front facing website for example I suffer from feature blindness, although for an actual database driven application (for example a purchase order system) I feel within my element.
Is there any resources that can help me view "brochure" sites in the same light than I do with a pro bono database driven applications.
This is really useful thread. I have always battled with use cases for brochure sites, despite totally espousing the use of UML... I often feel caught between UX agency outputs & trying to ensure the whole Requirements Spec ties together, especially when agencies tend not to use UML.
There are several use cases that do apply beyond view menu / view brochure page - site functionality like print page, search site etc, sometimes accept a cookie to view specific content - but not much on classic brochure-ware. (All that ties well into user journeys / personas without having to restate the UX deliverables)
However, once using a system eg a CMS to create the website content - then I think the use cases get properly useful (as per comments above), as there are not only (usually) several actors inc the system, but also varying cases per content type so you can reference those UX deliverables without duplication and start filling in the gaps, plus tie up content strategy type deliverables (eg workflow & governance) by looking into the business processes and the system / user interactions. At the end of the modelling & specifications, you can get useful test matrices this way; plus class diagrams that relate objects to taxonomies (more agency deliverables to tie together in Functional Rqmts / Specs stage).
That's the way I'm trying to tackle it these days.
Use Cases can be used to model requirements of a system. System is a structure with input and output mappings. So if you have a static web page, you cannot interact with it in a other way than to view it.
As discussed in comments, if you think you did not understood the goals of stakeholders (what that word document sent by your boss ment...), you have to ask more and find them, use cases are a good technique for this.
In a cycle, discover actors (systems and roles interacting with the system you have to develop) and use cases (what needs of those actors the developed system should ssatisfy). Every time you find an actor, you may ask what other needs (possible use cases) he has and when you find an use case, you should ask who will participate in it and who is interested in it (who is the next actor and who are the stakeholders). Then you can define the scope boundaries and prioritize...
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.