I'm not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn't language specific either but may have some sort of pseudo algorithm as an answer.
Basically I'm trying to learn about how spiders work and from what I can tell no spider I've found manages hierarchy. They just list the content or the links but no ordering.
My question is this: we look at a site and can easily determine visually what links are navigational, content related or external to a site.
How could we automate this? How could we pro grammatically help a spider detemine parent and child pages.
Of course the first answer would be to use the URL's directory structure.
E.g www.stackoverflow.com/questions/spiders
spiders is child of questions, questions is child of base site and so on.
But nowadays hierarchy is usually flat with ids being referenced in URL.
So far I have 2 answers to this question and would love some feedback.
1: Occurrence.
The links that occur the most in all pages would be dubbed as navigational. This seems like the most promising design but I can see issues popping up with dynamic links and others but they seem minuscule.
2: Depth.
Example is how many times do I need to click on a site to get to a certain page. This seems doable but if some information is advertised on the home page that is actually on the bottom level, it would be determined as a top level page or node.
So has anyone got any thoughts or constructive criticism on how to make a spider judge hierarchy in links.
(If anyone is really curious, the back end part of the spider will most likely be Ruby on rails)
What is your goal? If you want to crawl smaller number of websites and extract useful data for some kind of aggregator, its best to build focused crawler(Write crawler for every site).
If you want to crawl milion of pages ... Well than you must be very familiar with some advanced concepts from AI.
You can start from this article http://www-ai.ijs.si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf
Related
I've just started using SpecFlow. It's a tool for creating business understandable test scenarios in a BDD manner. Basically it transforms user stories to unit tests.
I'm a beginner to user stories and I wonder about its length. Is this a good practice to create very precise user stories? Here's an example:
In order to get help
As a StackOverflow user
I want to add post
with name and content
and add tags to it
and format the content
and the information about my post edits to be stored in the system
and some more things like that
Should I keep my stories compact? If so - how can I manage detailed requirements? Or maybe it's nothing wrong in very long and precise I want section in a user story?
If you could develop an entire system in a couple of weeks, and do that reliably, nobody would ever worry about "user stories". They'd just get you to develop the system, sit with you, and tweak it as it went.
User stories only exist in order to get feedback from people who can't be with you all the time, and to help you learn what it is that your users (and other stakeholders) really want.
Here's how I treat a list like this:
In order to get help
As a StackOverflow user
I want to add post
with name and content
and add tags to it
and format the content
and the information about my post edits to be stored in the system
You want to get help. Which of these actually add to your ability to get help? Is it you wanting help, or do you want to offer help to other people? Do you want recognition for the help you're offering other people? The top part of this seems false (and it's why it's really difficult to have these conversations with fake requirements).
I think there are multiple requirements here, and far beyond the scope of just one user story. With an analyst hat on, here's how I might break this down:
In order to award great content with appropriate recognition,
as Stack Exchange,
we want people's usernames to appear with their content.
Of course, the users want this too, but they're not paying for it (except through adverts). So work out who's paying for this, and why.
In order to get more page impressions and keep people on the site for longer,
as Stack Exchange,
we want users to be able to find similar content really easily.
Hm. This one's a bit trickier. See, the user doesn't really want to spend their entire life on StackOverflow. It's just that if we give them the appropriate recognition, and make it easier for others to find their content, they might do that. Not all "user stories" actually benefit users. Find out who's paying for them, and why; then you find your real stakeholder. It's also OK for a story to benefit more than one stakeholder, and it's easy to see how to rephrase this from the user's point of view as well.
format the content
Honestly not sure about this one. It might be about being able to emphasise important points, etc. There are a ton of aesthetic ideals that don't lend themselves well to BDD and automated scenarios. Sometimes the only way to do this is to try, and get feedback.
In order to avoid retyping my request every time
As the user
I want the information about my post edits to be stored in the system
Well, yes, that would be nice.
The thing is that each of these can be developed independently. If you can think of any feature, any item that you could get rid of and still have the release be valuable, put it in a separate story.
If you can replace "I want to..." with "I want to be able to..." it's likely that what you have there isn't a story, but an entire capability. Most people do this instinctively. Lots of people call those "epics".
I've just shown you how I break them down. It's a pretty simple process.
First, look at your requirements. If there's anything for which you can say, "I want to be able to..." or "Someone wants to be able to..." then you know that's a completely different capability, which means it's going to be a separate story.
You can then separate those into contexts. So you might have stories like:
In order to free up our junior traders
We want them to be able to generate contracts automatically
So that they can help with the trade analysis instead of typing.
If that seems too big for the feedback cycle (typically a two-week sprint), you can divide it further.
In order to free up our junior traders
We want them to be able to generate *orange juice* contracts automatically
So that they can help with the trade analysis instead of typing.
Here, we're focusing on being able to trade orange juice, but we could equally narrow the story down to the FTSE, or the US, or the NY stock exchange. This is how we focus the efforts on the thing that will deliver: protecting revenue, lowering costs or generating value.
To turn these into scenarios, I ask, "Can you give me an example of an OJ trade on the NY stock exchange?" If I see anything generic that I don't understand, I ask, "Can you give me an example of that?"
That example becomes my first scenario. The context (given) is defined by the limits of the story. The event (when) is the performance of the capability. The outcome (then) is the resulting value.
In answer to your question - yes, I think it's important to create precise user stories. That means knowing why it's valuable, defining the context that you're going to cover, and suggesting an example of what the outcome might be.
The example you gave is more than just one story, though. It's not precise enough. Hopefully the advice here will help you to narrow stories down to something useful. One or two days is a good length for a story, but if you're starting down this path and find they're a bit longer, that's OK.
Your changes are also stories.
I always advise the following:
Try cutting your stories in scenarios. The more scenarios, the better you can pinpoint when something is going wrong. Give all scenarios subjective names.
Now for example, your test. If step 1 goes wrong, all your other steps are not going to get tested.
Also use the Given, When and Then tags to read your scenarios easily.
So instead, you could say:
Feature: As a StackOverflow user I want to add a post
Scenario: I go to stackoverflow website
Given I open the browser
And I go to the stackoverflow website
When I click New Post
Then a new page appears to insert my data
Scenario: I fill in data for my post - Name and content
Given I do not modify this page
When I fill in name
And I fill in content
Then I add tags to it
And I format the content
Scenario: Check if information about post edits are stored in the system
Given...
Guess you will get where this is going :-)
There is no right detail level of user stories, as user stories shrink in size (scope) and grow in detail (specifications) over time. This slide shows a nice visualization from Gojko Adzic about this: http://www.slideshare.net/chassa/2015-0214agile-reqend2endcomplete/6
For the question on how precise and detailed a Gherkin scenario should be: Scenario should reveal interesting aspects of the user story to be implemented. They should use concrete (key) examples rather than abstract descriptions. The examples should focus on the aspect that should be illustrated. The scenario title should be an abstract description of the rule or aspect that is illustrated with the example(s) provided in the scenario.
You usually start with a main aspect (happy path) scenario, and then try to “break the model” by coming up with new examples (cases) that explore other aspects of the story. You start by asking the questions “How would you try out the story when it was implemented?” (happy path) and “What should happen if …?” to collect potential scenarios to consider (probably defining some of the questions to be out of scope for this story).
After that, you’re trying to answer these questions (scenario title) and illustrate them with concrete examples (scenario steps). This slide gives an idea of “break the model”: http://www.slideshare.net/chassa/2015-0214agile-reqend2endcomplete/61
I'm trying to build a vertical (meta) search engine for a particular industry. I'm trying to do somthing similar to "indeed.com" (job search engine) and "hotelscombined.com" (hotel search engine). I would like to know how do these two search engines build up their search results?
1) Is it using APIs of the other websites they serve results from? (seems odd to me since some results come from small and primitive sites).
2) Do other website post updates to these search engines? (Also seems odd as above)
3) Do they internally understand and create a map for each website they serve results from? (if so, then maybe they need to constantly monitor the structure of these sites for any changes. Seems error prone to me).
4) Any other possibilities?
I don't know even where to start, so any pointers in the right direction is much appreciated. (books, tutorials, hints, ideas...)
Thanks
It is mostly a mix of 1 and 3. Ideally, the site will have some sort of API they expose and document. If not, you have to do data scraping. Basically, you reverse-engineer their page. If they get results asynchronously via an undocumented API, you can use that API as well as (until they make a breaking change). Otherwise, it's simply a matter of pulling the text straight out of the HTML.
I don't know of any more advanced techniques since I don't do this myself, but several of my acquaintances have gone on to work on mobile apps that need to do this sort of thing with sports scores and such (not for searching, but same requirements - get someone else's data into our database). The low tech "pull it from the HTML until they change the HTML and break everything" is standard practice where they work.
2 is possible, but to do it you have to either make business arrangements with every source of data you want to use, or gain enough market presence for everyone to want to upload their data.
Also, you don't do this while actually searching (unless you have other constraints as Charles Duffy points out in his comment). You run a process that regularly goes out, gets all the data it can find, and inserts it into your own database, which you then search. This allows you to decouple data gathering from data searching - your search page won't have to know about and handle errors from the scraper, and the scraper has to only "get all the data" from each source instead of being able to transform queries from your site to search each source.
I'm indexing websites' content and I want to implement some categorization based solely on the urls.
I would like to tell appart content view pages from navigation pages.
By 'content view pages' I mean webpages where one can typically see the details of a product or a written article.
By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.
Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.
In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how.
Machine learning appear to be a broad topic, what should I start reading about in particular?
Which concepts, which algoritms, which tools?
If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.
After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.
You firstly need to collect a dataset of navigation / content pages and label them. After that its quite straight forward.
What language will you be using? I'd suggest you try Weka which is a java based tool in which you can simply press a button and get back performance measures of 50 odd algorithms from. After that you will know which is the most accurate and can deploy that.
I feel like you are trying to classify the Authority and Hub in a HITS algorithm.
Hub is your navigation page;
Authority is your content view page.
By doing a link analysis of every web pages, you should be able to find out the type of page by performing HITS on all the webpages in a domain. As shown in below graphs, the left graph shows the link relation between webpages. The right graph shows the scoring with respective to hub/authority after running HITS. HITS does not need any label to start. The updating rule is simple: basically just one update for authority score and another update for hub score.
Here is a tutorial discussing pagerank/HITS where I borrowed the above two graphs.
Here is an extended version of HITS to combine HITS and information retrieval methods (TF-IDF, vector space model, etc). This looks much more promising but certainly it needs more work. I suggest you start with naive HITS and see how good it is. On top of that, try some techniques mentioned in BHITS to improve your performance.
At my new job, I was given some MVC work. There is only one controller with nine action methods(6 are for ajax rendering) . The page was bit large, so I divided it into small your controls and used render partial to render them. Some user controls were being render through ajax also. Most of the controls are not more like foreach loops and rendering some data from tables, not more 10-15 lines. The main index page passes model to all the controls. My main page looked very clean and easy to maintain.
But my team members are saying, I should put everything in the main page rather than building small controls. Their point is number of files will be a lot, if we start creating controls like this. Also they say if we are not reusing these controls somewhere else there is no point creating them separately.
I would like to know what is better approach for this kind of scenario. Any good links which can help us to understand things better, or any book we can read to clarify our questions.
Help will be appreciated.
Regards
Parminder
As a preface to my answer, let me mention the important value of maintainability. Software evolves over time... and must change to fit the need of the application.
Maintainability in code does not magically appear... Sacrifices (with a touch of paranoia sometimes) must be made in your coding style now, to have the flexibility you'd like in the future.
There may a large page in your project. Some may say that if it works, no need to fix it. But that's looking at it from a short term perspective. You may need some of those UI interfaces in other places in the future. What some persons may do (rather than make partials) is copy that code in the places where they need it - thus causing the same bloat over time that they were trying to avoid.
If you're on the project in the long haul, you'll more fully appreciate the need for flexibility over time. You can see that there are patterns that you'll want to re-use.
My suggestion then: Partials and controls are good things... they are good investments for your ease in the future. If you forecast reusability, that's a good sign for using them.
But use them sparingly. Don't micromanage everything on a page. Some things may be itching to be 'component-ized' but sometimes it's best to SSFL (Save some for later). Like everything in life, balance is important.
Having clean concise code is the way to go. Your code will be alot more readable if you utilize :
sections
templates
partial views
Just remember its always easier to navigate folder structure than to read 100's - 1000's of lines of code.
I recommend watching "Putting your controllers on a diet" by Jimmy Bogard.
Also read "Fat Controllers" by Ian Cooper.
these two links will give you a good idea on how to structure your MVC apps.
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.