how nuodb manages the storage size increase - storage

Say my data store is going to increase in size, if the data increases how storage manager would manage the data. Does storage manager split the data with different domain machines ( definitely that is not the case)?
How exactly would the process work? What is the recommendation in this area, key-value store?

If you have a storage manager that is soon to run out of disk space, you can startup a new storage manager with a larger disk subsystem or that points to extensible cloud storage such as Amazon S3. Once the new storage manager is up-to-date the old one can be taken offline. This entire operation can be done while the database is running. Generally, we also recommend that you always run with at least 2 storage managers for redundancy.
If you have more questions, feel free to direct them to the NuoDB forum:
http://www.nuodb.com/community

NuoDB supports multiple back-end storage devices, including the Hadoop Distributed File System (HDFS). If you start a broker configured for HDFS, you can use HDFS tools to expand distributed storage on-the-fly and there's no need for any NuoDB admin operations. As Duke described, you can transition from a file-based Storage Manager to an HDFS one without interrupting services.
NuoDB interfaces with the storage layer using filesystem semantics for discrete storage units called "atoms". These map easily into the HDFS directory structure, simplifying administration on that end.

Related

How to separate application and data syncing implementations in Kubernetes?

I want to build an application/service which uses a static application data which can get updated over time. Currently, I implemented this by having both the application and the data within the same container. However, this requires redeployment when either application or data change. I want to separate the app and data volume implementation so that I can update the application and the data independently, meaning I won't have to rebuild the application layer when the application data is updated and vice versa.
Here are the characteristics of the Application Data and its Usage:
Data is not frequently updated, but read very frequently by the application
Data is not a database, it's a collection of file objects with size ranging from 100MB to 4GB which is initially stored in a cloud storage
Data stored in the cloud storage serves as a single source of truth for the application data
The application will only read from the Data. The process of updating data in cloud storage is an external process outside the application scope.
So here, we are interested in sync-ing the data in cloud storage to the volume in Kubernetes deployment. What's the best way to achieve this objective in Kubernetes?
I have several options in mind:
Using one app container in one deployment, in which the app will also include the logic for data loading and update which pulls data
from cloud storage to the container --> simple but tightly coupled with the storage read-write implementation
Using the cloud store directly from the app --> this doesn't require container volume, but I was concerned with the huge file size because the app is an interactive service which requires a quick response
Using two containers in one deployment sharing the same volume --> allow great flexibility for the storage read-write implementation
one container for application service reading from the shared volume
one container for updating data and listening to update data request which writes data to the shared volume --> this process will pull data from cloud storage to the shared volume
Using one container with a Persistent Disk
an external process which writes to the persistent disk (not sure how to do this yet with cloud storage/file objects, need to find a way to sync gcs to persistent disk)
one container for application service which reads from the mounted volume
Using Fuse mounts
an external process which writes to cloud storage
a container which uses fuse mounts
I am currently leaning towards option 3, but I am not sure if it's the common practice of achieving my objective. Please let me know if you have better solutions.
Yes. 3. is the most common option but make sure you use an initContainer to copy the data from your cloud storage to a local volume. That local volume could be any of the types supported by Kubernetes.

Is there a way to have a shared (temp) folder between apps or multiple instances of apps on Bluemix?

I am running a Rails app on Bluemix and want to use carrierwave for file uploads. So far no problem as I am using external storage to persist the files (ftp, s3, webdav etc.). However, in order to keep performance well I need to enable caching with carrierewave_backgrounder - and here it starts to get tricky. Thing is that I need to specify a temp folder for backgrounding the upload process (temp folder where the file remains before it is persisted on the actual storage), which is shared between all possible workers and app instances. If so at all, how can this be achieved?
Check out Object Storage - you can store files and then delete them when you no longer have a need for them. Redis is another option, as are any of the noSQL databases available on Bluemix.
typically in any cloud you never store on a file system of your VM or PaaS environment - reason being when you scale out, you have multiple VMS and a file written on one VM will not be available when 100s of VMs come up. The recommend practice is to look for storage services that the cloud platform provides. In Bluemix you have Storage options such as Cloud Object Storage, File Storgae and Block Storage.
As suggested before - you can take a look at the cloud object storage and utilize the service. Here is the documentation for Cloud Object Storage: https://ibm-public-cos.github.io/crs-docs/?&cm_mc_uid=06526771022514957173672&cm_mc_sid_50200000=1498597403&cm_mc_sid_52640000=1498599343. This contains quick start guide, storing, retrieving and API usage. Hope this helps.

How to properly manage storage in Jelastic

Okay, another question.
In AWS I have EBS, which allows me to create volumes, define iops/size for them, mount to desired EC2 machines and take snapshots.
How can I achieve same features in Jelastic? I have option to create "Storage Container" but it belongs only to one environment. How can I backup this volume?
Also, what's the best practice of managing storage devices for things like databases? Use separate storage container?
I have option to create "Storage Container" but it belongs only to one environment.
Yes the Storage Container belongs to 1 environment (either part of one of your other environments, or its own), but you can mount it in 1+ other containers (i.e. inside containers of other environments).
You can basically consider a storage container to be similar to AWS EBS: it can be mounted anywhere you like (multiple times even) in containers within environments in the same region.
How can I backup this volume?
Check your hosting provider's backup policy. In our case we perform backups of all containers for our customers for free. Customers do not need to take additional backups themselves. No need for those extra costs and steps... It might be different at some other Jelastic providers so please check this with your chosen provider(s).
If you wish to make your own backups, you can define a script to do it and set it in cron for example. That script can transfer archives to S3 or anything you wish.
what's the best practice of managing storage devices for things like databases?
Just like with AWS, you may experience performance issues if you use remote storage for database access. Jelastic should generally give you lower latency than EBS, but even so I recommend to keep your database storage local (not via storage containers).
Unlike AWS EC2, you do not have the general risk of local storage disappearing (i.e. your Jelastic containers local storage is not ephemeral; you can safely write data there and expect it to be persistent).
If you need multiple database nodes, it is recommended to use database software level clustering features (master-master or master-slave replication for example) instead of sharing the filesystem.
Remember that any shared filesystem is a shared (single) point of failure. What you gain in application / software convenience you may also lose in reliability / high availability. It's often worth making the extra steps in your application to handle this issue another way, or perhaps consider using lsyncd (there are Jelastic marketplace addons for this) to replicate parts of your filesystem instead of mounting a shared storage container.

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

Cloud with shared memory

do you know any if any well known clouds, e.g. Amazon, Azure, Google App Engine that has feature of shared memory? E.g. you can access data fast (from memory) and those are automatically synchronized with other nodes (machines...whatever).
Not quite shared memory, but Windows Azure has a Cache you can use. It's configurable from 128MB to 4GB, and exists outside of a specific deployment, letting you share cache content across instances, deployments, even on-premises applications.
More info on Cache is here.

Resources