Since Cassandra is based off of the Dynamo paper (distributed, self-balancing hash table) + BigTable and there are spatial indexes that would fit nicely into that paradigm (quadkey or geohash). Is there a reason that Geospatial support hasn't been implemented?
You could add a GeoPoint datatype as a tuple with an internal geohash and specify a CF as containing geo data. From there you can choose the behavior as having the geo data being a secondary index, or a denormalized SCF. That could lay the ground work for geospatial development and you could start by implementing some low hanging fruit such as .nearby() which could just return columns that share the same geohash. (I know that wouldn't give you the "nearest", you'd have to do a walk of surrounding geohashes or use a shape and a space filling curve for that which could be implemented later, but is a general operation for finding some nearby columns)
I know SimpleGeo/Urban Airship built geo support into Cassandra, but it doesn't look like that was ever opened up. Also, let me know if there's a better place to ask this (quora, mailing lists, etc...)
I think there are two parts to the answer.
The reason for why it's not there, is because nobody who commits code into Cassandra has thought of this feature, or thought that this capability is of high enough priority to spend major time on it. Most of the development in Cassandra is done by Datastax, and they, being a commercial entity, are privy to user demands and suggestions and also pretty pragmatic about what can give them the most ROI in terms of new features.
If there were a good enough third-party developer (or a team) with enough time on their hands, this could be done, and conceptually C* committers would likely have no problems about adding a major feature like this.
The second aspect is that Cassandra supports blobs (byte arrays), which means that what you're describing can be implemented in the client app/driver in a relatively straightforward manner. The drive would in that case be responsible for translating geo calls into appropriate raw byte operations. I'm also suspecting this would be less work than supporting a whole new data primitive with relevant set of operators in the core storage engine.
Related
With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com
I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.
What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.
Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.
So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.
I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.
I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.
If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!
For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.
If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.
But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.
The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.
The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.
Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.
Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.
In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.
Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.
Dimensions and measures are a well proven method for presenting and simplifying data to end users.
If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one
I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.
You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't
A star schema takes longer to build than just an ODS but gives many benefits including
Slowly changing dimensions can track changes over time
Denormalisation simplifies joins and improves performance
Surrogate keys allow you to disconnect from source systems
Conformed dimensions let you report across business units (i.e. Profit per headcount)
If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.
It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers
To paraphrase #destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system
With ETL and back-end data wrangling taking up about 70% of the project time for this kind of endeavour, an extra layer makes a big difference. Its an extra layer of transforming from source to target, to agree with the business and to test. It all adds up.
Whilst I'm not saying that dimensional models (the Kimball kind) are always easy to change, you've got a whole lot more inflexibility should you have to always change lots of layers when you want to change your BI.
In fact, where I've been consulting in places that have data warehouses that are considered to be inflexible and expensive to develop for, and not keeping pace with changes to the business, they have without exception included the 3NF layer prior to the DMs. As Nick mentioned, it is hard nowadays to sell the idea of a 'proper' data warehouse as opposed to a Data Discovery Bi tool- and the appeal of these is often driven by DWs being seen to be slow and expensive to develop.
Kimball isn't against having a 3NF layer prior to his DW if it makes sense for a situation, he just doesn't agree with Inmon that there's a point.
One common misunderstanding is that Kimball proposes distinct data marts, so that you'd have to change it each time there is a different reporting request. Instead, Kimball's DMs are based on real life business processes and modelled accordingly. Although its true you will then try and make them suitable for reporting, you try and make them so they can answer forseaable queries. You don't aggregate and store just the aggregates: you work with the transactional data in a Kimball dimensional model.
So no need to be reluctant from that perspective.
If an ODS works for you, then go for it- but a Kimball DW will meet the majority of requirements.
This question pertains to a game I have been developing, but I believe it is a pretty generic concept for which I have not been able to find a clear answer.
I have been trying to figure out how to serialize actors (objects in a game world) to a file, dynamically and at arbitrary times.
Context
To understand my question you need to know how the world is generally constructed. The game is a cell-based world with 3 dimensions divided into smaller, more manageable sections that I'll refer to as chunks. The terrain info is all fixed known length, and I can serialize that information just fine, simply writing/reading to/from a world file with the appropriate offset whenever that chunk needs to be loaded into memory (say a player gets near it). That's all well and good until I have to deal with actors and writing them to a single file.
The Problem
I know that ISerializable is an incredibly useful resource for actually obtaining the data from the actors, but the problem I'm having is committing that to disk dynamically. By that I mean inserting/removing actors from the middle of a big file containing all actors. It would be a lot easier if I could serialize the entire game state and actor tree, but I need to be able to do this on small sections of the world at a time. Some sections will have no actors, some will have many (say up to a couple hundred). These sections are being loaded and saved as the players move around the world. Furthermore, the number of actors and size of their data will change over the course of the game, so I cannot handle it like I do the terrain. I need a way of committing the actor quickly, where I can find it quickly later and am not wasting a lot of file space. One thing that may be of use is that all actors in a chunk are serialized/de-serialized at once, never individually.
Note: These worlds can get very large (16k x 16k x 6) and therefore easily have millions of actors in all.
The Question
Is a database really the best way to do this? I am not opposed to implementing one, but that is an involved process and I want to be sure it is a recommended course of actions before I continue. It seems like there might be serious performance implications.
A tradition database (RDBMS) is not always the right way to go. But alas, you ARE trying to persist data.
Most IT professionals will likely guide you towards a traditional database, simply because for us it ISN'T involved. It is out bread and butter. Further more, there are hundreds of libraries that make our lives easier, the latest generation of which are the full blown ORMs.
However, as you have noted, a full blown RDBMS is a little heavy weight for your application (depending on your particular scaling needs). So I'll suggests a few alternatives.
MongoDB
RavenDB
CouchDB
Cassandra
Redis
Now, it IS true that in many ways, these are much lighter weight than RDBMSs. However these so called NoSQL (I picked Document stores, since they seem to be the closest match to your requirements) are somewhat immature. That is not to say they are buggy, and unreliable (they have higher reliability than RDBMSs), but people don't really know how to work with them.
Again, I need to qualify that statement. RDBMS have several decades of research and best practices behind them. There are vast swathes of plug-ins to the tool chains of each implementation. Every single contributor in SO knows how to use a DB well. But, none of those things is true with NoSQL.
TLDR
So it really boils down to this. YES RDBMS (traditional DBs) are complex, like a modern road car. But like a road car (which you buy), these exists the infrastructure to support them.
The alternative is a NoSQL database, which is like building a small electric go scooter. Yes its simpler. But you take it to a car shop, and they'll still have no clue.
Finally
My advice. Use an off the shelf ORM with a RDBMS. The current generation of ORM can pretty much hide your database from you. The setup won't be very performant (you won't be doing microsecond algo trading with it), but it should be enough for your needs.
I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.
Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.
Does anyone know if it was done or its status if not?
Thanks!
Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.
Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.
Let me set some context first, and then answer your question.
As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.
Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.
More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/
And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)
So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.
Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:
Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...
Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.
TL;DR With 2018 is days away neo4j still does not support sharding as it is typically considered.
Details Neo4j still requires all data to fit on a single node. The node contents can be replicated within a cluster - but actual sharding is not part of the picture.
When neo4j talks of sharding they are referring to caching portions of the database in memory: different slices are cached on different replicated nodes. That differs from say mysql sharding in which each node contains only a portion of the total data.
Here is a summary of their "take" on scalability: their product term is "High Availability" https://neo4j.com/blog/neo4j-scalability-infographic/
. Note that High Availability should not be the same as Scalability: so they do not actually support the latter in the traditional understanding of the term.
In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails