Id a 10gb Total SQL Databases Quota enough for a site that only has comments, not any pictures or videos? say each post was (in LONGTEXT, under mysql) about a paragraph - would this be enough for say a few million or hundread million posts? how may about? Really appreshate the help - I found a good host site "http://www.ixwebhosting.com/index.php/v2/pages.hostingPlans", but it only has the 10gb.
The basic answer is no. The much more complicated answer is not even close. :)
You can get a rough sizing guess for database space by figuring out what cantent you want in your database, and multiplying that by the target number of users.
For instance, if you want users to be able to post an average webpage (say 4k), plus their user-profile, etc (perhaps another 1-2k) that would be 6K per user (in this very contrived example). if the users could upload additional content (forum posts, blog entries, friends lists, etc) this would be on top of it.
This doesnt include things like your own maintenance tables, other database storage requirements, etc.
This of course assumes that you aren't putting pictures, video, etc in the db which would bloat it even quicker.
It depends on so many things. How many users do you think you will you have? What sort of content will they be allowed to upload? What quotas will you give your users?
But without knowing any of these, I'll just guess: no, 10 GB is not going to be enough for a site like Facebook. You would need terabytes of storage if your site became popular. The high resolution photos of people's cats alone would fill your 10G.
Related
I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.
My team is building an application and a question came up during a brainstorming meeting that I was unsure about so I figured I'd reach out to the Stack Overflow community to get some more opinions. Here's the gist of it:
Our application creates an account for organizations which have multiple users. Those users have posts and comments.
The original database design was to have a table for organizations and then each user would be related by the organization's id.
One of the developers suggested that we use a separate database for each organization's account to separate data between organizations and to increase performance. I'd never seen any Rails app that used multiple databases that way and I wasn't sure how to even do that in Rails.
My question to you then is:
Would we gain any benefit from using separate databases or is this adding an unneccessary level of complexity to the application?
Example
Say there are 100 organizations. Each organization has 100 users. Each user has 100 posts and 100 comments.
Would querying through these tables be a major performance drain or would having 100 separate databases be unwieldy and cause more issues than it would be worth? Does this cause issues with migrations? The schema would be identical between organizations.
I'm not sure if that was a clear enough question so let me know if you need more information before answering.
I did read the following Stack Overflow articles but they really didn't help me with this decision.
Single or separate databases for separate customer accounts?
Use one large database or use single databases per customer
Don't do it. It will just add a level of unnecessary complexity.
Don't optimize for a problem before it becomes a problem. If your application gets to a point where database performance is a huge bottleneck, address that issue then. For now, concentrate on writing good, fast queries
37 Signals just posted this article about how, with good hardware, they have managed to avoid sharding and the associated sys admin overhead.
You can shard database there are various gem is also available,but as per my point of view there is no need to create separate database for your app ,but it's up to you ,you can improve DB performance by various ways
I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here
I have developed a social networking site for gardeners website, and am interested in giving users the ability to add images to their "tweets".
If I allow them to upload images to the actual site, it seems like this will quickly become expensive (this is a side project, not funded by anyone than myself and my own obsessions). Let's say the site becomes moderately popular, with 100K users posting one image a week, of only 250K in size. That's (100000 * .1 * 52 / 1024) = 508 MB/year in storage (and that doesn't take into account increased bandwidth). Plus I'd have to increase the server load to scale the images. I'm not sure if I should just go ahead with this, or if there are better possibilities.
Linking to other sites seems better in some ways. You do have broken links, but a larger concern for me is security: XSS.
The application is on Rails 3, using MongoDB / Mongoid as the backend, if that matters.
I'm looking for solutions such as:
APIs that store images on external sites. What would be ideal is the ability to upload it to my site, and make an API call to store it on an external site.
APIs (perhaps Javascript APIs) that make it easy to link to one or more external image hosting sites securely.
Markdown or similar markup that allow linking to external images securely. I am interested in giving users the ability to format their posts in limited ways, so this might solve two problems at the same time. I notice that this is what Stack Overflow does.
Security libraries that whitelist image URL patterns
Advice on why I am thinking about this problem wrong. For example, maybe I should just store the images. 500MB a year is really not all that expensive, and it does allow me to create a very clean user experience.
My objectives are (in order):
- Secure, both for my own site, and to not allow XSS attacks against other sites
- Best possible user experience
- Easy to maintain and implement
What have you done to allow user-supplied images on your site?
You're thinking about the problem wrong ;) or rather not at the right time.
Don't worry about the bandwidth now, when you don't have that many users yet. Concentrate on making the site user friendly and popular first. Performance, bandwidth, disk space - these are the things you'll work on when they become problems. By the time you've 100k users the cost of buying that space and bandwidth on, say, Amazon S3 may not be an issue anymore.
Why not using a service like Amazon s3? Is cheap, very cheap (With the Reduced Redundancy Storage), and the most important plugins like Paperclip support it out of the box...
You will need to look at the T&C of picture hosts (flickr etc...) and see if your usage is applicable. Flickr has an API, not sure about the others just search for HOST api.
Flickrs api is at:
http://www.flickr.com/services/api/
I'm trying to figure out how to best implement a public data hosting service.
How do websites that let users upload pictures enforce their terms of service regarding obscene pictures? Do they use image processing algorithms to flag potential violations (too many skin-colored pixels)? I think Imageshack looks at the websites that their pictures are hotlinked on, and checks for keywords. If it detects anything porn related, then it removes the picture and bans the account. Are there other methods?
Is enforcement largely automated or is it based more on user reports?
I suppose it depends on the scale of your "public data hosting service".
If it's something small with maybe a couple hundreds pictures per day flowing in, you can moderate them on your own.
If it's a couple hundred thousands you'll need an amount of human beings sorting the weeds out. It's either a moderator team or users themselves who submit abuse reports.
Which one to go, can be dependent on your budget/financial success of your service as well as on the type of the service. If it's something simple like Rapidshare where one does not see what the other does, the chances that users will see each others content and through this notice and hopefully report unacceptable content are small. If it's something very social like Flickr you can bet on it reports will be flowing in.
I suppose you could automate something but it's almost an impossible task. You can't automatically detect porn. You can't automatically detect images violating copyrights - making footprints of copyrighting material in order to compare them with the uploaded stuff is a real challenge for companies with resources like Rapidshare, Youtube and others. For now this kind of work can effectively be done only by humans.
There are also legal issues to it. In some countries the service owner is not liable for what users contribute (well, if he's cooperative enough to delete certain content at request), in others he will get the charges himself for not having premoderated all the incoming content. Also think of this with regard to whatever and wherever you are going to launch.
I don't have links, but while it's certainly a difficult task prone to errors, software to detect improper content does exist. Or at least that's what the Security Manager at NASA told me - if if was just a means to scare me I don't know ;-)