I'm investigating different tools for building a vertical search engine. I've come across http://www.sphider.eu/ and I think it works fine.
However, I'm not sure how many pages it can index. There seems to be no information about this in the documentatation. I think I will index around hundred sites for this particular search engine and I will have to reindex them at least once a day.
Can Sphider manage this? Or should I look for other solutions?
Related
I am trying to implement synonym searching in the Examine search engine that comes with Umbraco 8 out of the box.
Does anyone have any experience with implementing synonym searching in Examine/Umbraco 8. The options that I have been considering after looking around are -
A package that can be installed in Umbraco 8 that offers this extended functionality (if one exists).
Implementing a custom index (currently just using the out of the box 'ExternalIndex') that somehow implements synonym searching in the analysis (via custom analyzer implementation etc - If that is even possible).
Manually formatting multiple search terms by checking for synonyms in the string beforehand, running all searches and consolidating the results after (really a nasty, last resort option - you don't have to tell me how bad this is, I already know).
I have been trawling around the forums for a definitive answer on this and cannot really find one. Essentially I want to stick with the Examine engine for simplicity, however I am starting to think that the best way to achieve what I am after would be to move to a new engine completely (elastic search for example).
Many thanks in advance.
Use algolia? It's free and will do what you need easily? https://www.algolia.com/
The Examine is based on something called the Lucene search index. Lucene is known to not really do synonyms I'm afraid (read why here and potential solution).
Your thinking is probably correct. Examine is good at what it does, if you want to use more advanced searching then you will be better off using a more advanced search provider. There are loads of options, Algolia is Saas and comes with a free plan depending on your usage. It's easy to install and you target data from the front-end.
YOu could also look into Azure Cognitive Search or Solr. These are probably harder to implement but will also do the job
I am working in a big company and we are having a lot of JIRA projects, I would like to have a dashboard or a way to know if the projects that exist in JIRA are used, e.g if there are any issues in them (I don't need to see the issues just to have a number).
Can I do it without accessing to the database, do I need a plugin, is there a functional way to get the info? :)
thanks a lot
best regards
Adrien k
You can easily do this with the built-in Two Dimensional Filter Statistics gadget:
first, search for all issues in your JIRA instance. There may be an easier way to do this, but you can certainly use JQL like project=ABC or project != ABC.
save the search as a filter
go to a dashboard, add a new Two Dimensional Filter Statistics gadget. Select your newly-saved filter, select "Project" for one axis, and something small in number (like Issue Type) to the other axis. You'll also need to adjust "Number of Results" to exceed the number of issue types in your system.
save the gadget
Note that the Projects gadget also provides somewhat-similar information with fewer configuration requirements, but as far as I know, it doesn't show the numeric issue totals unless you hover the mouse pointer over the bars.
The company I work for makes a plugin that can do that - Structure
That's an example structure containing all issues in available projects, they are then grouped by project, and there's a column showing the number of sub-items (issues) in each group (project).
You can also add a structure to a dashboard/Confluence page.
On a large JIRA instance it be a bit on the expensive side to use it just for that alone though...
I'm trying to build a vertical (meta) search engine for a particular industry. I'm trying to do somthing similar to "indeed.com" (job search engine) and "hotelscombined.com" (hotel search engine). I would like to know how do these two search engines build up their search results?
1) Is it using APIs of the other websites they serve results from? (seems odd to me since some results come from small and primitive sites).
2) Do other website post updates to these search engines? (Also seems odd as above)
3) Do they internally understand and create a map for each website they serve results from? (if so, then maybe they need to constantly monitor the structure of these sites for any changes. Seems error prone to me).
4) Any other possibilities?
I don't know even where to start, so any pointers in the right direction is much appreciated. (books, tutorials, hints, ideas...)
Thanks
It is mostly a mix of 1 and 3. Ideally, the site will have some sort of API they expose and document. If not, you have to do data scraping. Basically, you reverse-engineer their page. If they get results asynchronously via an undocumented API, you can use that API as well as (until they make a breaking change). Otherwise, it's simply a matter of pulling the text straight out of the HTML.
I don't know of any more advanced techniques since I don't do this myself, but several of my acquaintances have gone on to work on mobile apps that need to do this sort of thing with sports scores and such (not for searching, but same requirements - get someone else's data into our database). The low tech "pull it from the HTML until they change the HTML and break everything" is standard practice where they work.
2 is possible, but to do it you have to either make business arrangements with every source of data you want to use, or gain enough market presence for everyone to want to upload their data.
Also, you don't do this while actually searching (unless you have other constraints as Charles Duffy points out in his comment). You run a process that regularly goes out, gets all the data it can find, and inserts it into your own database, which you then search. This allows you to decouple data gathering from data searching - your search page won't have to know about and handle errors from the scraper, and the scraper has to only "get all the data" from each source instead of being able to transform queries from your site to search each source.
When we mention internal links for a website should the internal links be mentioned with the domain or with /. Which would be better from the SEO point of view. For example is my page is www.testdomain.com/about.htm, and I give an internal link to this page from another page, should I mention the internal link as
About
or
About
Which would suit SEO better? Thanks in advance.
From an SEO standpoint: no difference whatsoever.
From a maintenance standpoint: please go with About
Although there is no difference in both styles but i would say stick to the standard method. In this case i will say "About" (second one) is the right way..
There are a couple of really good reasons to code relative URLs
1) It is much easier and faster to code.
When you are a web developer and you're building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You'll see it happen a lot.
2) Staging environments
Another reason why you might see relative versus absolute URLs is some content management systems -- and SharePoint is a great example of this -- have a staging environment that's on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it's more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don't yell at your web dev team if they've coded relative URLS, because from their perspective it is a better solution.
Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It's very negligible. If you have a really, really long page load time, there's going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.
Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.
There are even better reasons to use absolute URLs
1) Scrapers
If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it's great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that's probably not something that you want happening with your beautiful, hardworking, handcrafted website. That's one reason. There is a scraper risk.
2) Preventing duplicate content issues
But the other reason why it's very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don't have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they're the same page to you. They're four different pages to Google. They're the same domain to you. They are four different domains to Google.
But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you've got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven't fixed this problem.
Again, it's not always a huge issue. Duplicate content, it's not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.
You do want to think about internal linking, when you're thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you've built, you're diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that's one reason.
3) Crawl Budget
The other reason why it's pretty important not to do is because of crawl budget. I'm going to point it out like this instead.
When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There's a finite number of URLs that they will crawl and then they decide, "Okay, I'm done." That's based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it's updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.
It's important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google's biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that's going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.
So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that's updated frequently, your site may not be updating in the index as frequently as you're updating it. It may also mean that Google, while it's crawling and indexing, may be crawling and indexing a version of your website that isn't the version that you really want it to crawl and index.
So having four different versions of your website, all of which are completely crawlable to the last page, because you've got relative URLs and you haven't fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they're going to do that less and less frequently, especially if you don't have a really high authority website. If you're a small website, if you're just starting out, if you've only got a medium number of inbound links, over time you're going to see your crawl rate and frequency impacted, and that's bad. We don't want that. We want Google to come back all the time, see all our pages. They're beautiful. Put them up in the index. Rank them well. That's what we want. So that's what we should do.
I'm not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn't language specific either but may have some sort of pseudo algorithm as an answer.
Basically I'm trying to learn about how spiders work and from what I can tell no spider I've found manages hierarchy. They just list the content or the links but no ordering.
My question is this: we look at a site and can easily determine visually what links are navigational, content related or external to a site.
How could we automate this? How could we pro grammatically help a spider detemine parent and child pages.
Of course the first answer would be to use the URL's directory structure.
E.g www.stackoverflow.com/questions/spiders
spiders is child of questions, questions is child of base site and so on.
But nowadays hierarchy is usually flat with ids being referenced in URL.
So far I have 2 answers to this question and would love some feedback.
1: Occurrence.
The links that occur the most in all pages would be dubbed as navigational. This seems like the most promising design but I can see issues popping up with dynamic links and others but they seem minuscule.
2: Depth.
Example is how many times do I need to click on a site to get to a certain page. This seems doable but if some information is advertised on the home page that is actually on the bottom level, it would be determined as a top level page or node.
So has anyone got any thoughts or constructive criticism on how to make a spider judge hierarchy in links.
(If anyone is really curious, the back end part of the spider will most likely be Ruby on rails)
What is your goal? If you want to crawl smaller number of websites and extract useful data for some kind of aggregator, its best to build focused crawler(Write crawler for every site).
If you want to crawl milion of pages ... Well than you must be very familiar with some advanced concepts from AI.
You can start from this article http://www-ai.ijs.si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf