Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am currently developing a thesis project for the Master of Computer Engineering. The project is being developed in a business environment and consists on the creation of an abstract module based on the concept of blockchain, so it is possible to integrate it into several company products. In the course of my research a number of questions have arisen:
In blockchain concept there are several nodes that share a ledger, in which each node participates in the network (inserting data in the ledger and validating this data). Does it make sense that only 1 node enter data and the other nodes just serve the consensus mechanism? If this makes no sense, what are the alternatives?
It makes sense to have a ledger common to all customers that contain data in this ledger, and is this ledger not distributed by these customers, but by other entities that are responsible for maintaining the ledger and serving the consensus mechanism?
Can any node read the ledger data? Do these restrictions depend only on technology?
In blockchain concept there are several nodes that share a ledger, in which each node participates in the network (inserting data in the ledger and validating this data). Does it make sense that only 1 node enter data and the other nodes just serve the consensus mechanism? If this makes no sense, what are the alternatives?
Yes it’s possible. Not every participating node has to be a mining node. You have basically 2 types of nodes in the network that is mining nodes and transactional nodes.
It makes sense to have a ledger common to all customers that contain data in this ledger, and is this ledger not distributed by these customers, but by other entities that are responsible for maintaining the ledger and serving the consensus mechanism?
don’t understand this question. Please put it in a better way
Can any node read the ledger data? Do these restrictions depend only on technology?
Any node participating can read the data. Infact anyone in the world can read the ledger without node I.e., through Block explorers like etherscan
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am looking for an answer to this question in the context of the VIPER Architectural pattern -
If you have an application that talks to both a web api and a database how many dataManagers should you have one, two or three?
Case
a) dataManager
b) APIDataManager and LocalDataManager
c) dataManager, APIDataManager and LocalDataManager
Where in
a) The interactor talks to a single dataManager that talks to any services you may have (remote or local).
b) The interactor knows the difference between local and remote information - and calls either the APIDataManager or the LocalDataManager, which talk to remote and local services respectively.
c) The interactor only talks to a general dataManager, the general dataManager then talks to the APIDataManager and LocalDataManager
EDIT
There may be no definitive solution. But any input would be greatly appreciated.
Neither VIPER nor The Clean Architecture dictate that there must be only one data manager for all interactors. The referenced VIPER article uses only one manager just as an example that the specific storage implementation is abstracted away.
The interactor objects implement the application-specific business rules. If what the app does is talk to the server, then turn around and talk to the local disk store, then it’s perfectly normal for an interactor to know about this. Even more, some of the interactors have to manage exactly this.
Don’t forget that the normal object composition rules apply to the interactors as well. For example, you start with one interactor that gets data from the server and saves it to the local store. If it gets too big, you can create two new interactors, one doing the fetching, another one—saving to the local store. Then your original interactor would contain these new ones and delegate all its work to them. If you follow the rules for defining the boundaries, when doing the extract class refactoring, you won’t event have to change the objects that work with the new composite interactor.
Also, note that in general it is suggested not to name objects with manager or controller endings because their roles become not exactly clear. You might name the interface that talks to the server something like APIClient, the one that abstracts your local storage—something like EntityGateway or EntityRepository.
It depends on where the abstraction lies within your app, that is distinguishing what you do from how you do it. Who is defining that there are two different data stores?
If local and remote data stores are part of the problem domain itself (e.g. sometimes the problem requires fetching remote data, and other times it requires fetching local data), it is sensible for the interactor to know about the two different data stores.
If the Interactor only cares about what data is requested, but it does not care about how the data is retrieved, it would make sense for a single data manager to make the determination of which data source to use.
There are two different roles at play here—the business designer, and the data designer. The interactor is responsible for satisfying the needs of the business designer, i.e. the business logic, problem domain, etc. The data layer is responsible for satisfying the needs of the data designer, i.e. the server team, IT team, database team, etc.
Who is likely to change where you look to retrieve data, the business designer, or the data designer? The answer to that question will guide you to which class owns that responsibility.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Using Delphi XE2: I have used AbsoluteDB for years with good success for smallish needs, but it does not scale well to large datasets. I need recommendations for a DB engine for the following conditions:
Large (multi-gigabyte) dataset, few tables with lots of small records. This is industrial-equipment historical data; most tables have new records written once a minute with a device ID, date, time and status; a couple tables have these records w/ a single data point per record, three others have 10 to 28 data points per record depending on the device type. One of the single-data-point tables adds events asynchronously and might have a dozen or more entries per minute. All of this has to be kept available for up to a year. Data is usually accessed by device ID(s) and date window.
Multi-user. A system service retrieves the data from the devices, but the trending display is a separate application and may be on a separate computer.
Fast. Able to pull any 48-hour cluster of data in at most a half-dozen seconds.
Not embedded.
Single-file if possible. Critical that backups and restores can by done programatically. Replication support would be nice but not required.
Can be integrated into our existing InstallAware packages, without user intervention in the install process.
Critical: no per-install licenses. I'm fine with buying per-seat developer licenses, but we're an industrial-equipment company, not a software company - we're not set up for keeping track of that sort of thing.
Thanks in advance!
I would use
either PostgreSQL (more proven than MySQL for such huge data)
or MongoDB
The main criteria is what you would do with the data. And you did not say much about that. Would you do individual queries by data point? Would you need to do aggregates (sum/average...) over data points per type? If "Data is usually accessed by device ID(s) and date window", then I would perhaps not store the data in individual rows, one row per data point, but gather data within "aggregates", i.e. objects or arrays stored in a "document" column.
You may store those aggregates as BLOB, but it may be not efficient. Both PostgreSQL and MongoDB have powerful objects and arrays functions, including indexes within the documents.
Don't start from the DB, but start from your logic: which data are you gathering? how is it acquired? how is it later on accessed? Then design high level objects, and let your DB store your objects in an efficient way.
Also consider the CQRS pattern: it is a good idea to store your data in several places, in several layouts, and make a clear distinction between writes (Commands) and reads (Queries). For instance, you may send all individual data points in a database, but gather the information, in a ready-to-use form, in other databases. Don't hesitate to duplicate the data! Don't rely on a single-database-centric approach! This is IMHO the key for fast queries - and what all BigData companies do.
Our Open Source mORMot framework is ideal for such process. I'm currently working on a project gathering information in real time from thousands of remote devices connected via Internet (alarm panels, in fact), then consolidating this data in a farm of servers. We use SQLite3 for local storage on each node (up to some GB), and consolidate the data in MongoDB servers. All the logic is written in high-level Delphi objects, and the framework does all the need plumbing (including real-time replication, and callbacks).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I can see 2 big companies like Klarna and Whatsapp are using Mnesia as their in memory database (not sure how they persist data with Mnesia with 2GB limit). My question is: why companies like those, and may be more I don't know, use Mnesia instead of Riak or couchDB, both are Erlang, where both databases support faster in memory databases, better painless persistence, and much more features. Do I miss something here?
You are missing a number of important points:
First of all, mnesia has no 2 gigabyte limit. It is limited on a 32bit architecture, but hardly any are present anymore for real work. And on 64bit, you are not limited to 2 gigabyte. I have seen databases on the order of several hundred gigabytes. The only problem is the initial start-up time for those.
Mnesia is built to handle:
Very low latency K/V lookup, not necessarily linearizible.
Proper transactions with linearizible changes (C in the CAP theorem). These are allowed to run at a much worse latency as they are expected to be relatively rare.
On-line schema change
Survival even if nodes fail in a cluster (where cluster is smallish, say 10-50 machines at most)
The design is such that you avoid a separate process since data is in the Erlang system already. You have QLC for datalog-like queries. And you have the ability to store any Erlang term.
Mnesia fares well if the above is what you need. Its limits are:
You can't get a machine with more than 2 terabytes of memory. And loading 2 teras from scratch is going to be slow.
Since it is a CP system and not an AP system, the loss of nodes requires manual intervention. You may not need transactions as well. You might also want to be able to seamlessly add more nodes to the system and so on. For this, Riak is a better choice.
It uses optimistic locking which gives trouble if many processes tries to access the same row in a transaction.
My normal goto-trick is to start out with Mnesia in Erlang-systems and then switch over to another system as the data size grows. If data sizes grows slowly, then you can keep everything in memory in Mnesia and get up and running extremely quickly.
As for persistent storage capacity for mnesia, "the 2 gb limit for disk tables" is a common delusion.
Read this post
What is the storage capacity of a Mnesia database?
very attentively. There are no actual limits for mnesia disk table size.
Mnesia is free unlike riak(for commercial usage).
Read about cap theorem. You can build your own ca or cp or ap database using plain mnesia as a backend. But if you take a particular dbms, say couchdb, it is designed to be ap out of box. And you cant make it, say , ca(as far as I know)
As far as I can tell, neither Riak nor (See note about BitCask in the comments) CouchDB support in-memory databases. I could be wrong on Riak, but I work on CouchDB, so I am very sure.
Engineers are choosing mnesia over Riak or CouchDB because it solves a different problem.
Whether they are big companies is no factor in this.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project to implement an historian.
I can't really find a difference between an historian and a data warehouse.
Any details would be useful.
Data Historian
Data historians are groups of tables within a database that store historical information about a process or information system.
Data historians are used to keep historical data regarding a manufacturing system. This data can be changes in state of a data point, current values and summary data for these points. Usually this data comes from automated systems like PLCs, DCS or other process controlling system. However some historian data can be human entered.
There are several historians available for commercial use. However, one of the most common historians have tended to be custom developed. The commercial versions would be products like OsiSoft’s PI or GE’s Data Historian.
Some examples of data that could be stored in a data historian are items (or tags) like:
- Total products manufactured for the day
-Total defects created on a particular crew shift
-Current temperature of a motor on the production line
-Set point for the maximum allowable value being monitored by another tag
-Current speed of a conveyor
-Maximum flow rate of a pump over a period of time
-Human entered marker showing a manual event occured
-Total amount of a chemical added to a tank
These items are some of the important data tags that might be captured. However, once captured the next step is in presentation or reporting of that data. This is where the work of analysis is of great importance. The data/time stamp of one tag can have a huge correlation to another/other tag(s). Carefully storing this in the historians’ database is critical to good reporting.
The retrieval of data stored in a data historian is the slowest part of the system to be implemented. Many companies do a great job of putting data into a historian, but then do not go back and retrieve any of the data. Many times this author has gone into a site that claims to have a historian only to find that the data is “in there somewhere”, but has never had a report run against the data to validate the accuracy of the data.
The rule-of-thumb should be to provide feedback on any of the tags entered as soon as possible after storage into the historian. Reporting on the first few entries of a newly added tag is important, but ongoing review is important too. Once the data is incorporated into both a detailed listing and a summarized list the data can be reviewed for accuracy by operations personnel on a regular basis.
This regular review process by the operational personnel is very important. The finest data gathering systems that might historically archive millions of data points will be of little value to anyone if the data is not reviewed for accuracy by those that are experts in that information.
Data Warehouse
Data warehousing combines data from multiple, usually varied, sources into one comprehensive and easily manipulated database. Different methods can then be used by a company or organization to access this data for a wide range of purposes. Analysis can be performed to determine trends over time and to create plans based on this information. Smaller companies often use more limited formats to analyze more precise or smaller data sets, though warehousing can also utilize these methods.
Accessing Data Through Warehousing
Common methods for accessing systems of data warehousing include queries, reporting, and analysis. Because warehousing creates one database, the number of sources can be nearly limitless, provided the system can handle the volume. The final result, however, is homogeneous data, which can be more easily manipulated. System queries are used to gain information from a warehouse and this creates a report for analysis.
Uses for Data Warehouses
Companies commonly use data warehousing to analyze trends over time. They might use it to view day-to-day operations, but its primary function is often strategic planning based on long-term data overviews. From such reports, companies make business models, forecasts, and other projections. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am currently looking for a good middleware to build a solution to for a monitoring and maintenance system. We are tasked with the challenge to monitor, gather data from and maintain a distributed system consisting of up to 10,000 individual nodes.
The system is clustered into groups of 5-20 nodes. Each group produces data (as a team) by processing incoming sensor data. Each group has a dedicated node (blue boxes) acting as a facade/proxy for the group, exposing data and state from the group to the outside world. These clusters are geographically separated and may connect to the outside world over different networks (one may run over fiber, another over 3G/Satellite). It is likely we will experience both shorter (seconds/minutes) and longer (hours) outages. The data is persisted by each cluster locally.
This data needs to be collected (continuously and reliably) by external & centralized server(s) (green boxes) for further processing, analysis and viewing by various clients (orange boxes). Also, we need to monitor the state of all nodes through each groups proxy node. It is not required to monitor each node directly, even though it would be good if the middleware could support that (handle heartbeat/state messages from ~10,000 nodes). In case of proxy failure, other methods are available to pinpoint individual nodes.
Furthermore, we need to be able to interact with each node to tweak settings etc. but that seems to be more easily solved since that is mostly manually handled per-node when needed. Some batch tweaking may be needed, but all-in-all it looks like a standard RPC situation (Web Service or alike). Of course, if the middleware can handle this too, via some Request/Response mechanism that would be a plus.
Requirements:
1000+ nodes publishing/offering continuous data
Data needs to be reliably (in some way) and continuously gathered to one or more servers. This will likely be built on top of the middleware using some kind of explicit request/response to ask for lost data. If this could be handled automatically by the middleware this is of course a plus.
More than one server/subscriber needs to be able to be connected to the same data producer/publisher and receive the same data
Data rate is max in the range of 10-20 per second per group
Messages sizes range from maybe ~100 bytes to 4-5 kbytes
Nodes range from embedded constrained systems to normal COTS Linux/Windows boxes
Nodes generally use C/C++, servers and clients generally C++/C#
Nodes should (preferable) not need to install additional SW or servers, i.e. one dedicated broker or extra service per node is expensive
Security will be message-based, i.e. no transport security needed
We are looking for a solution that can handle the communication between primarily proxy nodes (blue) and servers (green) for the data publishing/polling/downloading and from clients (orange) to individual nodes (RPC style) for tweaking settings.
There seems to be a lot of discussions and recommendations for the reversed situation; distributing data from server(s) to many clients, but it has been harder to find information related to the described situation. The general solution seems to be to use SNMP, Nagios, Ganglia etc. to monitor and modify large number of nodes, but the tricky part for us is the data gathering.
We have briefly looked at solutions like DDS, ZeroMQ, RabbitMQ (broker needed on all nodes?), SNMP, various monitoring tools, Web Services (JSON-RPC, REST/Protocol Buffers) etc.
So, do you have any recommendations for an easy-to-use, robust, stable, light, cross-platform, cross-language middleware (or other) solution that would fit the bill? As simple as possible but not simpler.
Disclosure: I am a long-time DDS specialist/enthusiast and I work for one of the DDS vendors.
Good DDS implementations will provide you with what you are looking for. Collection of data and monitoring of nodes is a traditional use-case for DDS and should be its sweet spot. Interacting with nodes and tweaking them is possible as well, for example by using so-called content filters to send data to a particular node. This assumes that you have a means to uniquely identify each node in the system, for example by means of a string or integer ID.
Because of the hierarchical nature of the system and its sheer (potential) size, you will probably have to introduce some routing mechanisms to forward data between clusters. Some DDS implementations can provide generic services for that. Bridging to other technologies, like DBMS or web-interfaces, is often supported as well.
Especially if you have multicast at your disposal, discovery of all participants in the system can be done automatically and will require minimal configuration. This is not required though.
To me, it looks like your system is complicated enough to require customization. I do not believe that any solution will "fit the bill easily", especially if your system needs to be fault-tolerant and robust. Most of all, you need to be aware of your requirements. A few words about DDS in the context of the ones you have mentioned:
1000+ nodes publishing/offering continuous data
This is a big number, but should be possible, especially since you have the option to take advantage of the data-partitioning features supported by DDS.
Data needs to be reliably (in some way) and continuously gathered to
one or more servers. This will likely be built on top of the
middleware using some kind of explicit request/response to ask for
lost data. If this could be handled automatically by the middleware
this is of course a plus.
DDS supports a rich set of so-called Quality of Service (QoS) settings specifying how the infrastructure should treat that data it is distributing. These are name-value pairs set by the developer. Reliability and data-availability area among the supported QoS-es. This should take care of your requirement automatically.
More than one server/subscriber needs to be able to be connected to
the same data producer/publisher and receive the same data
One-to-many or many-to-many distribution is a common use-case.
Data rate is max in the range of 10-20 per second per group
Adding up to a total maximum of 20,000 messages per second is doable, especially if data-flows are partitioned.
Messages sizes range from maybe ~100 bytes to 4-5 kbytes
As long as messages do not get excessively large, the number of messages is typically more limiting than the total amount of kbytes transported over the wire -- unless large messages are of very complicated structure.
Nodes range from embedded constrained systems to normal COTS
Linux/Windows boxes
Some DDS implementations support a large range of OS/platform combinations, which can be mixed in a system.
Nodes generally use C/C++, servers and clients generally C++/C#
These are typically supported and can be mixed in a system.
Nodes should (preferable) not need to install additional SW or
servers, i.e. one dedicated broker or extra service per node is
expensive
Such options are available, but the need for extra services depends on the DDS implementation and the features you want to use.
Security will be message-based, i.e. no transport security needed
That certainly makes life easier for you -- but not so much for those who have to implement that protection at the message level. DDS Security is one of the newer standards in the DDS ecosystem that provides a comprehensive security model transparent to the application.
Seems ZeroMQ will fit the bill easily, with no central infrastructure to manage. Since your monitoring servers are fixed, it's really quite a simple problem to solve. This section in the 0MQ Guide may help:
http://zguide.zeromq.org/page:all#Distributed-Logging-and-Monitoring
You mention "reliability", but could you specify the actual set of failures you want to recover? If you are using TCP then the network is by definition "reliable" already.