If I am building a graph on AWS Neptune with 20 million Nodes and 100 million edges, How much RAM and Disk space would I require? Can someone give me an rough order of magnitude estimate
Storage capacity in Amazon Neptune is dynamically allocated as you write data into a Neptune cluster. A new cluster starts out with 10GB allocated and then grows in 10GB segments as your data grows. As such, there's no need to pre-provision or calculate storage capacity prior to use. A Neptune cluster can hold up to 64TB of data, which is in the 100s of billions of vertices, edges, and properties range (or triples, if using RDF on Neptune).
RAM (and CPU, for that matter) needs are driven by query complexity, not by graph size. RAM is also used for buffer pool cache, caching the vertices, edges, and properties that were most recently queried.
Related
As the image shows that, as the memory capacity increases the accessing time is also increasing.
Does it make sense that, accessing time is dependent on the memory capacity..???
No. The images show that technologies with lower cost in $ / GB are slower. Within a certain level (tier of the memory hierarchy), performance is not dependent on size. You can build systems with wider busses and so on to get more bandwidth out of a certain tier, but it's not inherently slower to have more.
Having more disks or larger disks doesn't make disk access slower, they're close to constant latency determined by the nature of the technology (rotating platter).
In fact, larger-capacity disks tend to have better bandwidth once they do seek to the right place, because more bits per second are flying under the read / write heads. And with multiple disks you can run RAID to utilize multiple disks in parallel.
Similarly for RAM, having multiple channels of RAM on a big many-core Xeon increases aggregate bandwidth. (But unfortunately hurts latency due to a more complicated interconnect vs. simpler quad-core "client" CPUs: Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?) But that's a sort of secondary effect, and just using RAM with more bits per DIMM doesn't change latency or bandwidth, assuming you use the same number of DIMMs in the same system.
I find there is too many memory usage when shuffle occurred in Spark process.
Following figure is memory metric when I use 700MB data and just three rdd.map.
(I use Ganglia as monitoring tool, and show just three nodes of my cluster. x-axis means time-series, y-axis means memory usage)
enter image description here
and following figure is also memory metric when I use same data and use three rdd.groupBy, three rdd.flatMap (order : groupBy1->flatMap1->groupBy2->flatMap2->groupBy3->flatMap3)
enter image description here
As you can see, all of three node's memory is considerably increased (several GB) even though I use just 700MB data. Indeed I have 8 worker node, and all of 8 worker's memory is considerably increased.
I think the main cause is shuffle since rdd.map has no shuffle but rdd.groupBy has shuffle.
In this situation, I wonder three point below :
why is there too many memory usage? (more than 15GB is used when I use 700MB in all of my worker node.)
why does it seem that used memory for old shuffle is not removed before Spark application is finished?
Is there any way to reduce memory usage or remove memory generated in old shuffle?
P.S. - My environment :
cloud platform : MS Azure (8 worker nodes)
Spec. of one worker : 8 cores CPU, 16GB RAM
Language : Java
Spark version : 1.6.2
Java version : 1.7(development), 1.8(execution)
Run in Spark-standalone (Not use Yarn or Mesos)
In Spark, The operating system will decide if the data can stay in its buffer cache or should it be spilled to DISK. Each map task creates as many shuffle spill files as number of reducers. SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
Please refer to this post which very well explains this in detail in continuation to above explanation.
You can also refer to Optimizing Shuffle Performance in Spark paper.
~Kedar
I am using neo4j recently. My data size is only moderate: a little less than 5 million nodes, around 24 million edges and 30 million properties. This data size is not huge according to traditional relational database such as MySQL or Oracle. But when I run neo4j, it seems quite memory demanding. To me, a database should not be memory demanding: if you have sufficient memory and allow to use as much, it will perform faster. But if you don't have much memory, it should still work. But for neo4j, it is interrupted (due to low memory) sometimes (not consistently but is annoying enough as I expect a database to be much more reliable).
To be more specific, I have a Linux machine that has 8G memory. I only allow an initial and max heap size of 2G to run the graph data.
Anyone experiencing something similar? Any solutions?
Neo4j uses off-heap RAM to cache the graph to speed up reading nodes, relationships and properties.
You can tweak the amount of memory being used for caching by setting dbms.memory.pagecache.size.
Does anyone here can help me compare the price/month of these two elasticsearch hosting services?
Specifically, what is the equivalent of the Bonsai10 that costs $50/month when compared to the amazon elasticsearch pricing?
I just want to know which of the two services saves me money on a monthly basis for my rails app.
Thanks!
Bonsai10 is 8 core 1GB memory 10GB disk, limited to 20 shards & 1 million documents.
Amazon's AES doesn't have comparable sizing/pricing. All will be more expensive.
If you want 10GB of storage, you could run a single m3.large.elasticsearch (2 core 7.5GB memory, 32GB disk) at US$140/month.
If you want 8 cores, single m3.2xlarge.elasticsearch (8 core 30GB memory, 160GB disk) at US$560/month.
Elastic's cloud is more comparable. 1GB memory 16GB disk will run US$45/month. They don't publish the CPU count.
Out of the other better hosted elasticsearch providers (because they list actual resources you receive, full list below), qbox offers the lowest cost comparable plan for US$40/month for 1GB memory 20GB disk. No CPU count https://qbox.io/pricing
Objectrocket
Compose.io (an IBM company)
Qbox
Elastic
Is Couchbase a kind of storage that address GroupBy-based read and write of 4TB worth of data with low latency? If not, what size of data Couchbase is good for for low latency access ?
Couchbase can definitely handle 4TB of data. It will be fast to the degree you can keep your working set in RAM. So you can have disk greater than memory, but you want to have a really small # of cache-miss rates, which we let you monitor. If you see that % get too high, it is time to grow your cluster so that more ram becomes available.
4TB should be a few tens of nodes. At that scale, disk throughput starts to be the limiting factor (eg slow disks take too long to warm up lots of ram). So for really hot stuff, people use SSDs, but for the majority of apps EC2 is plenty fine.