Lets say if i have a server of 500gb of disk space. Suppose, if i have to create 500VM's of each virtual machine of size 50gb. How can I do it?
I was reading across few puzzles and came across this question.
Most VMs will allow you to create dynamic disks. So you create a dynamic disk of a maximum size (say 10GB), but it only actually uses what is written to the disk (which is generally much less).
Of course if you fill the disk with 10GB of data, then it uses 10GB of real storage.
What do I get by running multiple nodes on a single host? I am not getting availability, because if the host is down, the whole cluster goes with it. Does it make sense regarding performance? Doesn't one instance of ES take as many resources from the host as it needs?
Generally no, but if you have machines with ridiculous amounts of CPU and memory, you might want that to properly utilize the available resources. Avoiding big heaps with Elasticsearch is a good thing generally since garbage collection on bigger heaps can become a problem and in any case above 32 GB you lose the benefit of pointer compression. Mostly you should not need big heaps with ES. Most of the memory that ES uses is through memory mapped files, which relies on the OS cache. So just because you aren't assigning memory to the heap doesn't mean it is not being used: more memory available for caching means you'll be able to handle bigger shards or more shards.
So if you run more nodes, that advantage goes away and you waste memory on redundant heaps, and you'll have nodes competing for resources. Mostly, you should base these decisions on actual memory, cache, and cpu usage of course.
It depends on your host and how you configure your nodes.
For example, Elastic recommends allocating up to 32GB of RAM (because of how Java compresses pointers) to elasticsearch and have another 32GB for the operating system (mostly for disk caching).
Assuming you have more than 64GB of ram on your host, let's say 128, it makes sense to have two nodes running on the same machine, having both configured to 32GB ram each and leaving another 64 for the operating system.
I have a GlusterFS setup with two nodes(node1 and node2) setup to a replicated volume.
The volume contains many small files, 8kb - 200kb in size. When I subject node1 to heavy read load, glusterfsd and glusterfs processed together uses ~ 100% CPU on both nodes.
There is no write load on any of the nodes. But why is the CPU load so high, on both nodes?
As I understand it all the data is replicated to both nodes, so it "should" perform like a local filesystem.
this is commonly related to small files, e.g. if you have PHP apps running from a gluster volume.
This one bit me in the rear once, and it mostly has to do that in many php frameworks, you get a lot of stats to see if a file exists at that spot, if not, it will state a level (directory) higher, or with a slightly different name. Repeat 1000 times. Per file.
Now here's the catch: that lookup if the file exists does not just happen on that node / the local brick. (if you use replication), but on ALL the nodes / bricks involved. The cost involved can explode fast. (specially on some cloud platforms, where IOPS are capped)
This article helped me out significantly. In the end there was still a small penalty, but the benefits outweighed that.
I'd like to use a neo4j database in a docker container with Odroid XU4. The database is not big, approximately 20.000 nodes will be in it. The Odroid has only 2G memory, and I'd like to have a samba server, some nodejs applications and at least one PgSQL database too, so the system is short on memory. I read in the neo4j manual that 2G memory is the minimum, but I read by docker examples that it is used with 512M, so I am a little confused about this. What is the minimum memory I can use the neo4j docker image with?
I have similar troubles with the disk space. The system is on a 32GB SD card. I'd like to save database data there and backup on an external hard drive, so I could spend max 16GB for the neo4j. The data certainly does not require that kind of space, I am not sure why neo4j needs it (according to the manual again).
First you can use http://neo4j.com/hardware-sizing-calculator/ to get rough estimate for memory and disk usage.
Second option is to do some math. You can use information on page 12 in http://graphaware.com/assets/bachman-msc-thesis.pdf
You should keep in mind it's good to have all data in the memory for the performance reasons.
From my point of view you shouldn't have problem with the memory, but you can't expect great performance.
It's better to try it by yourself before you ask here ;)
We are a small bootstrapped ISP in a third world country where bandwidths are usually expensive and slow. We recently got a customer who need storage solution, of 10s of TB of mostly video files (its a tv station). The thing is I know my way around linux but I have never done anything like this before. We have a backblaze 3 storage pod casing which we are thinking of using as a storage server. The Server will be connected to customer directly so its not gonna go through the internet, because 100+mbps speed is unheard off in this part of the world.
I was thinking of using 4TB HDD all formatted with ext4 and using LVM to make them one large volume (50-70tb at least). So customer logs in to an FTP like client and dumps whatever files he/she wants. But the customer only sees a single volume, and we can add space as his requirements increases. Of course this is just on papers from preliminary research as i don't have prior experience with this kind of system. Also I have to take cost in to consideration so can't go for any proprietary solution.
My questions are:
Is this the best way to handle this probably, are there equally good or better solutions out there?
For large storage solutions (at least large for me) what are my cost effective options when it comes to dealing with data corruption and HD failure.
Would love to hear any other solutions and tips you guys might have. thanks!
ZFS might be a good option but there is no native bug-free solution for Linux, yet. I would recommend other operating systems in that case.
Today I would recommend Linux MD raid5 on enterprise disks or raid6 on consumer/desktop disks. I would not assign more than 6 disks to an array. LVM can then be used to tie the arrays to a logical volume suitable for ext4.
The ext4-filesystem is well tested and stable while XFS might be better for large file storage. The downside to XFS is that it is not possible to shrink an XFS filesystem. I would prefer ext4 because of it's more flexible nature.
Please also take into consideration that backups are still required even if you are storing your data on raid-arrays. The data can silently corrupt or be accidentally deleted.
In the end, everything depends on what the customer wants. Telling the customer the price of the service usually has an effect on the requirements.
I would like to add to the answer that mingalsuo gave. As he stated, it really comes down to the customer requirements. You don't say what, specifically, the customer will do with this data. Is it for archive only? Will they be actively streaming the data? What is your budget for this project? These types of answers will better determine the proposed solution. Here are some options based on a great many assumptions. Maybe one of them will be a good fit for your project.
In this case, you are not that concerned about performance but more interested in capacity. In this case, the number of spindles don't really matter much. As Mingalsuo stated, put together a set of RAID-6 SATA arrays and use LVM to produce a large volume.
In this case, you need performance. The customer is going to store files but also requires the ability for a small number of simultaneous data streams. Here you want as many spindles as possible. For streaming, it does little good to focus on the size of the controller cache. Just focus on the number of spindles. You want as many as possible. Keep in mind that the time to rebuild a failed drive increases with the size of the drive. And, during a rebuild, your performance will suffer. For these reasons I'd suggest smaller drives. Maybe 1TB drives at most. This will provide you with faster rebuild times and more spindles for streaming.
Here you need high performance - similar to that that an enterprise demands. You require many simultaneous data streams and performance is required. In this case, I would stay away from SATA drives and use 900G or 1.2TB SAS drives instead. I would also suggest that you consider abstracting the storage layer from the server layer. Create a Linux server and use iSCSI (or fibre) to connect to the storage device. This will allow you to load balance if possible, or at the very least make recovery from disaster easier.
You stated that the environment has few high-speed connections to the internet. Again, depending on the requirements, you still might consider cloud storage. Hear me out :) Let's assume that the files will be uploaded today, used for the next week or month, and then rarely read. In this case, these files are sitting on (potentially) expensive disks for no reason except archive. Wouldn't it be better to keep those active files on expensive (local) disk until they "retire" and then move them to less expensive disk? There are solutions that do just that. One, for example, is called StorSimple. This is an appliance that contains SAS (and even flash) drives and uses cloud storage to automatically migrate "retired" data from the local storage to cloud storage. Because this data is retired it wouldn't matter if it took longer than normal to move it to the cloud. And, this appliance automatically pulls it back from the cloud to local storage when it is accessed. This solution might be too expensive for your project but there are similar ones that you might find will work for you. The added benefit of this is that your data is automatically backed up by the cloud provider and you have an unlimited supply of storage at your disposal.
Can I set up a replica set in MongoDB 1.8 using servers with different amounts of RAM?
server1: 5gb
server2: 2gb
server3: 4gb
If yes, what are the pros and cons?
No, you do not need equal RAM. (Yes, you could set up a replica set as described.)
MongoDB uses memory-mapped files for all caching, which means that cache paging is handled by the operating system. The replicas with more memory will keep more of the database in memory; those with less will page more to disk.
MongoDB will eventually bring the entire database into memory if it can. If you're using two replicas for reads and one for writes, you might want to use the 5gb and 4gb machines for reads, so they are more likely to be hitting RAM.
Yes, you can configure a replica set this way.
If yes, what are the pros and cons?
Here's a doc explaining the major features of replica sets. Let's take a look at these in light of the RAM differences.
More computers means better data redundancy. Having that 2GB node at least means that you have one more copy of the data.
Having a full 3 nodes on a replica set makes it easier to take one down for maintenance.
Having servers of different sizes isn't great for automated failover. Let's say that your 5GB server is the primary. What happens when it goes down and the 2GB server wins the election? You still have automated fail-over, but your performance has probably dropped dramatically.
Read scaling may not work very well. Depending on your read patterns, sending reads to the 2GB server may result in lots of extra disk hits and slower performance.
So, the big problem here, is really one of performance. If you're just doing this for a dev setup, then it will basically work. But in production you run the risk of completely tanking your app. If your app is used to living on 4GB+ of RAM and then suddenly drops to 2GB, it may become unusable.
Most production setups want to fail over to another "equally-powered" computer.