How to archive old nodes in Umbraco8 - umbraco

We're working on a news website, we have large amount of data, around 1 million news with 500 GB of media. we did some researches about the best practices, but there's lack of resources in handling this issue.
we concluded that we must archive the old news those are rarely visited by unpublishing the old years containers (2009,2010,...2014) to have a fast website in the backoffice and in the front, after unpublishing the containers we noticed that the Examine indexes files are still large and news nodes are no longer available via the original URL's because we unpublished their parents.
Please provide me with any insight that can help.

If you unpublish the parent folder, the news articles will no longer be available, precisely because you've unpublished the nodes. If you unpublish any part of a path in Umbraco, the pages beneath it will no longer be served.
The indexes will still be large, because unpublished content is still stored in some of the indexes.
I know of a few agencies that have dealt with something similar to what you are trying to do, and they dealt with it by archiving off the old articles to Elastic or another similar external indexing service. The original articles are then deleted to keep the site fast. The archive page for the site then serves up the archived articles from the external index, rather than from Umbraco. This does however mean that the older articles effectively become read only.

Related

How do "Indeed.com" and "hotelscombined.com" search other sites?

I'm trying to build a vertical (meta) search engine for a particular industry. I'm trying to do somthing similar to "indeed.com" (job search engine) and "hotelscombined.com" (hotel search engine). I would like to know how do these two search engines build up their search results?
1) Is it using APIs of the other websites they serve results from? (seems odd to me since some results come from small and primitive sites).
2) Do other website post updates to these search engines? (Also seems odd as above)
3) Do they internally understand and create a map for each website they serve results from? (if so, then maybe they need to constantly monitor the structure of these sites for any changes. Seems error prone to me).
4) Any other possibilities?
I don't know even where to start, so any pointers in the right direction is much appreciated. (books, tutorials, hints, ideas...)
Thanks
It is mostly a mix of 1 and 3. Ideally, the site will have some sort of API they expose and document. If not, you have to do data scraping. Basically, you reverse-engineer their page. If they get results asynchronously via an undocumented API, you can use that API as well as (until they make a breaking change). Otherwise, it's simply a matter of pulling the text straight out of the HTML.
I don't know of any more advanced techniques since I don't do this myself, but several of my acquaintances have gone on to work on mobile apps that need to do this sort of thing with sports scores and such (not for searching, but same requirements - get someone else's data into our database). The low tech "pull it from the HTML until they change the HTML and break everything" is standard practice where they work.
2 is possible, but to do it you have to either make business arrangements with every source of data you want to use, or gain enough market presence for everyone to want to upload their data.
Also, you don't do this while actually searching (unless you have other constraints as Charles Duffy points out in his comment). You run a process that regularly goes out, gets all the data it can find, and inserts it into your own database, which you then search. This allows you to decouple data gathering from data searching - your search page won't have to know about and handle errors from the scraper, and the scraper has to only "get all the data" from each source instead of being able to transform queries from your site to search each source.

Internal Linking with domain or with / for seo

When we mention internal links for a website should the internal links be mentioned with the domain or with /. Which would be better from the SEO point of view. For example is my page is www.testdomain.com/about.htm, and I give an internal link to this page from another page, should I mention the internal link as
About
or
About
Which would suit SEO better? Thanks in advance.
From an SEO standpoint: no difference whatsoever.
From a maintenance standpoint: please go with About
Although there is no difference in both styles but i would say stick to the standard method. In this case i will say "About" (second one) is the right way..
There are a couple of really good reasons to code relative URLs
1) It is much easier and faster to code.
When you are a web developer and you're building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You'll see it happen a lot.
2) Staging environments
Another reason why you might see relative versus absolute URLs is some content management systems -- and SharePoint is a great example of this -- have a staging environment that's on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it's more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don't yell at your web dev team if they've coded relative URLS, because from their perspective it is a better solution.
Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It's very negligible. If you have a really, really long page load time, there's going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.
Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.
There are even better reasons to use absolute URLs
1) Scrapers
If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it's great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that's probably not something that you want happening with your beautiful, hardworking, handcrafted website. That's one reason. There is a scraper risk.
2) Preventing duplicate content issues
But the other reason why it's very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don't have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they're the same page to you. They're four different pages to Google. They're the same domain to you. They are four different domains to Google.
But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you've got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven't fixed this problem.
Again, it's not always a huge issue. Duplicate content, it's not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.
You do want to think about internal linking, when you're thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you've built, you're diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that's one reason.
3) Crawl Budget
The other reason why it's pretty important not to do is because of crawl budget. I'm going to point it out like this instead.
When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There's a finite number of URLs that they will crawl and then they decide, "Okay, I'm done." That's based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it's updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.
It's important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google's biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that's going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.
So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that's updated frequently, your site may not be updating in the index as frequently as you're updating it. It may also mean that Google, while it's crawling and indexing, may be crawling and indexing a version of your website that isn't the version that you really want it to crawl and index.
So having four different versions of your website, all of which are completely crawlable to the last page, because you've got relative URLs and you haven't fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they're going to do that less and less frequently, especially if you don't have a really high authority website. If you're a small website, if you're just starting out, if you've only got a medium number of inbound links, over time you're going to see your crawl rate and frequency impacted, and that's bad. We don't want that. We want Google to come back all the time, see all our pages. They're beautiful. Put them up in the index. Rank them well. That's what we want. So that's what we should do.

Proper Amazon AWS S3 usage

Heres a quick rundown. We have two apps. Both apps are different from each other in every aspect.
Our first app is a company profile (4 page layout: home, products, about, contact). I don't think it will generate the same traffic as social web sites online. It allows staff members to post product photos. Is it better to have an AWS S3 account to store this content on? Or would I be better off storing the files on the local web server?
Our second app is more social oriented. Here, we have decided to use an S3 bucket for this app. Since we already have an AWS account, should we create a bucket for the first app on this AWS account? Or will this just render more expenses in the long run? What are your thoughts?
On another note. Should app related content (logo, icons, background images, buttons, etc) be stored on the S3 account or on the local server? What is the general consensus on this?
I would keep anything that is "hard-coded" into your site, IE the images in your CSS and anything in your HTML/ERB/HAML views that are referenced by their file name as part of the web server itself. You don't have to do this, but one reason to do it is if there's any interruption of Amazon S3 or any kind of configuration issue etc at any point in the future these 'basic' visual resources won't be affected, so your site will still look pretty much intact.
Plus, this kind of content has a specific appropriate place to live (your public folder) and rails helpers that make it really easy to access.
I just can't see a reason to mess with S3 if it's not going to be high volume of data and bandwidth.
On the other hand, if it IS going to be very high bandwidth and you're using a cheap hosting platform that doesn't give you tons of bandwidth, moving those images to S3 would save you some. But are you really in danger of exceeding your bandwidth limits on the server?
Lastly, I think your OTHER app is using S3 as expected, that's a good use of S3. I would not necessarily recommend mixing buckets with other apps, though. The simplest reason for this is, let's say one of your apps grows and at some point merges with, partners with, or is sold to another party. Really, imagine any scenario where another interest becomes involved in either app.
Now you've got no way to maintain separation of access between the resources your two different apps are using. So you have to either migrate to a new platform for one of them, or you have to deal with them being 'joined at the hip' so to speak.
In other words, I just don't see an upside to sharing AWS resources between two apps that aren't directly related to each other. No reason to make them a package deal if they aren't naturally a package deal.
On the other hand if it DOES make sense for them to be totally linked, then it doesn't really matter either way.
Hope that helps you think about it. Good luck!
s3 is ridiculously cheap so I'd put any image uploads on there.
In the past I've put static images, stylesheets and javascripts and so forth on s3 but finally I just leave them on the web server. Reason for this is jammit which is oh sooo cool in handling packaging, updating, compressing, etc, etc of assets.

How do I allow safely and inexpensively allow images on my site?

I have developed a social networking site for gardeners website, and am interested in giving users the ability to add images to their "tweets".
If I allow them to upload images to the actual site, it seems like this will quickly become expensive (this is a side project, not funded by anyone than myself and my own obsessions). Let's say the site becomes moderately popular, with 100K users posting one image a week, of only 250K in size. That's (100000 * .1 * 52 / 1024) = 508 MB/year in storage (and that doesn't take into account increased bandwidth). Plus I'd have to increase the server load to scale the images. I'm not sure if I should just go ahead with this, or if there are better possibilities.
Linking to other sites seems better in some ways. You do have broken links, but a larger concern for me is security: XSS.
The application is on Rails 3, using MongoDB / Mongoid as the backend, if that matters.
I'm looking for solutions such as:
APIs that store images on external sites. What would be ideal is the ability to upload it to my site, and make an API call to store it on an external site.
APIs (perhaps Javascript APIs) that make it easy to link to one or more external image hosting sites securely.
Markdown or similar markup that allow linking to external images securely. I am interested in giving users the ability to format their posts in limited ways, so this might solve two problems at the same time. I notice that this is what Stack Overflow does.
Security libraries that whitelist image URL patterns
Advice on why I am thinking about this problem wrong. For example, maybe I should just store the images. 500MB a year is really not all that expensive, and it does allow me to create a very clean user experience.
My objectives are (in order):
- Secure, both for my own site, and to not allow XSS attacks against other sites
- Best possible user experience
- Easy to maintain and implement
What have you done to allow user-supplied images on your site?
You're thinking about the problem wrong ;) or rather not at the right time.
Don't worry about the bandwidth now, when you don't have that many users yet. Concentrate on making the site user friendly and popular first. Performance, bandwidth, disk space - these are the things you'll work on when they become problems. By the time you've 100k users the cost of buying that space and bandwidth on, say, Amazon S3 may not be an issue anymore.
Why not using a service like Amazon s3? Is cheap, very cheap (With the Reduced Redundancy Storage), and the most important plugins like Paperclip support it out of the box...
You will need to look at the T&C of picture hosts (flickr etc...) and see if your usage is applicable. Flickr has an API, not sure about the others just search for HOST api.
Flickrs api is at:
http://www.flickr.com/services/api/

SQL Databases Quota Website Hosting

Id a 10gb Total SQL Databases Quota enough for a site that only has comments, not any pictures or videos? say each post was (in LONGTEXT, under mysql) about a paragraph - would this be enough for say a few million or hundread million posts? how may about? Really appreshate the help - I found a good host site "http://www.ixwebhosting.com/index.php/v2/pages.hostingPlans", but it only has the 10gb.
The basic answer is no. The much more complicated answer is not even close. :)
You can get a rough sizing guess for database space by figuring out what cantent you want in your database, and multiplying that by the target number of users.
For instance, if you want users to be able to post an average webpage (say 4k), plus their user-profile, etc (perhaps another 1-2k) that would be 6K per user (in this very contrived example). if the users could upload additional content (forum posts, blog entries, friends lists, etc) this would be on top of it.
This doesnt include things like your own maintenance tables, other database storage requirements, etc.
This of course assumes that you aren't putting pictures, video, etc in the db which would bloat it even quicker.
It depends on so many things. How many users do you think you will you have? What sort of content will they be allowed to upload? What quotas will you give your users?
But without knowing any of these, I'll just guess: no, 10 GB is not going to be enough for a site like Facebook. You would need terabytes of storage if your site became popular. The high resolution photos of people's cats alone would fill your 10G.

Resources