In my system I have a relational DB table with "id" columns, and I am representing some of that same data in Neo4J.
My first approach is to make an "id" attribute in Neo which correlates to the id column.
Is there any reason that this isn't a good practice? Does it conflict technically or conceptually with the node IDs that Neo generates?
If the ids serve the purpose of uniquely distinguishing the nodes that will get generated then yes its good to have one.
But keep in mind the possibility that if your graph grows in future and say a situation arrives that another DB table needs to be modelled into graph and by any chance say some ids in the new DB table conflict with the old DB table then in that situation you will get into trouble maintaining uniqueness of node.
And node ids that neo4j generates are recommended not to be used as they are prone to be reused in case the nodes are deleted.
In case you just want to model the DB table into graph database and dont want to relate the graph data to your db table later on, you can use UUID.randomUUID().toString() to generate random unique UUIDs(extremely less probability of duplicate UUID) for ids of nodes.
Related
I have setup a time-series / events database using the AWS Firehose -> S3/Glue -> Athena stack. It is being used to track various user actions - session started, action performed etc. across a number of our products. My question is about how best to store different types of IDs in this system.
The existing schema is one big 'fact table' with a bunch of different columns. Two of the most important columns are event_type_id and object_id. To use StackOverflow as an example, two events might be:
question_asked - in this case I would be storing the question id in the object_id column.
tag_created - in this case I would be storing the tag id in the object_id column.
My question is - is storing multiple different types of IDs in the same column bad practice? It's working OK for us at the moment, but it does require the person/system performing queries to know what type of object the object_id column refers to, based on the event they are querying.
If bad practice, what other approaches might be better? Multiple columns where they are NULL if not relevant for the event in that row? Or is this where dimension tables would be a better fit?
This isn't necessarily bad practice, depending on how you use it.
It sounds like you're aware of the potential pitfalls of such an approach (i.e. users of the data have to be aware of the context - in this case "event type" - to use the values correctly), so as you're using Athena you could mitigate that by creating views over source table for different event types, inserting a WHERE clause filter on event type and possibly renaming object_id to something more context specific e.g. question_id.
This makes it easier for users to work with the data and understand exactly what the values are they're working with.
In a big data environment I wouldn't recommend creating dimension tables if it can be avoided as JOINs between tables start to get expensive. Having multiple columns for different ids is possible but then you create new problems for users such as having to account for NULL values in an Id column, and this also potentially makes it harder to add new event types and ids as you have to change the schema to accommodate them.
I wonder if it's possible to create a logic that automatically creates a denormalized table and it's data (and maintains it) by a specific SQL-like query.
Given a system where the user can maintain his data model and data. All data are stored in "relational" tables, but those tables are only used for the user to maintain his data. If he wants to display data on a webpage he has to write a query (SQL) which will automatically turn into a denormalized table and also be kept up-to-date when updating/deleting the relational data.
Let's say I got a query like this:
select t1.a, t1.b from t1 where t1.c = 1
The logic will automatically create a denormalized table with a copy of the needed data according to the query. It's mostly like a view (I wonder if views will be more performant than my approach). Whenever this query (give it a name) is needed by some business logic it will be replaced by a simple query on that new table.
Any update in t1 will search for all queries where t1 is involved and update the denormalized data automatically, but for performance win it will only update the rows infected (in this example just one row). That's the point where I'm not sure if it's achievable in an automatic way. The example query is simple, but what if there are queries with joins, aggregation or even sub queries?
Does an approach like this exist in the NoSQL world and maybe can somebody share his experience with it?
I would also like to know whether creating one table per query does conflict with any best practises when using NoSQL databases.
I have an idea how to solve simple queries just by finding the involved entity by its primary key when updating data and run the query on that specific entity again (so that joins will be updated, too). But with aggregation and sub queries I don't really know how to determine which denormalized table's entity is involved.
I am new to Neo4j and I am trying to convert a relational model to a graph model. In this model, I have two labels X and Y which there is a relationship between them. This relationship has property P. The problem is that this P should get its values from an external table (list of possible values for P). How should I relate this property values to be obtained from this external table.
I can't say I'm completely following, but at the most basic level, if you already have x and y nodes modeled and populated (with unique constraints on the primary keys), and if you have a join table with x and y primary keys and a value that should be on the relationship, then it's a matter of reading in the import file of the join table, matching to the corresponding x and y nodes via the primary keys, then merging the appropriate relationship between them, adding any additional properties on the relationship as needed.
However, it's always a good idea to check if this is the best way to model what you want in a graph db. So far you've only been describing tables and how they relate, but getting a better description of the big picture of what this data represents and logically how it relates to each other might provide insights for modeling the data in a way that makes more sense for a graph db. Could you provide in your description a more verbal description of what exactly you're trying to model, how it relates to each other, and the kind of questions you want to ask of your data?
I am a DB newbie to the bitemporal world and had a naive question.
Say you have a master-satellite relationship between two tables - where the master stores essential information and the satellite stores information that is relevant to only few of the records of the master table. Example would be 'trade' as a master table and 'trade_support' as the satellite table where 'trade_support' will only house supporting information for non-electronic trades (which will be a small minority).
In a non-bitemporal landscape, we would model it as a parent-child relationship. My question is: in a bitemporal world, should such a use case be still modeled as a two-table parent-child relationship with 4 temporal columns on both tables? I don't see a reason why it can't be done, but the question of "should it be done" is quite hazy in my mind. Any gurus to help me out with the rationale behind the choice?
Pros:
Normalization
Cons:
Additional table and temporal columns to maintain and manage via DAO's
Defining performant join conditions
I believe this should be a pretty common use-case and wanted to know if there are any best practices that I can benefit from.
Thanks in advance!
Bitemporal data management and foreign keys can be quite tricky. For a master-satellite relationship between bitemporal tables, an "artificial key" needs to be introduced in the master table that is not unique but identical for different temporal or historical versions of an object. This key is referenced from the satellite. When joining the two tables a bitemporal context (T_TIME, V_TIME) where T_TIME is the transaction time and V_TIME is the valid time must be given for the join. The join would be something like the following:
SELECT m.*, s.*
FROM master m
LEFT JOIN satellite s
ON m.key = s.master_key
AND <V_TIME> between s.valid_from and s.valid_to
AND <T_TIME> between s.t_from and s.t_to
WHERE <V_TIME> between m.valid_from and m.valid_to
AND <T_TIME> between m.t_from and m.t_to
In this query the valid period is given by the columns valid_from and valid_to and the transaction period is given by the columns t_from and t_to for both the master and the satellite table. The artificial key in the master is given by the column m.key and the reference to this key by s.master_key. A left outer join is used to retrieve also those entries of the master table for which there is no corresponding entry in the satellite table.
As you have noted above, this join condition is likely to be slow.
On the other hand this layout may be more space efficient in that if only the master data (in able trade) or only the satellite data (in table trade_support) is updated, this will only require a new entry in the respective table. When using one table for all data, a new entry for all columns in the combined table would be necessary. Also you will end up with a table with many null values.
So the question you are asking boils down to a trade-off between space requirements and concise code. The amount of space you are sacrificing with the single-table solution depends on the number of columns of your satellite table. I would probably go for the single-table solution, since it is much easier to understand.
If you have any chance to switch database technology, a document oriented database might make more sense. I have written a prototype of a bitemporal scala layer based on mongodb, which is available here:
https://github.com/1123/bitemporaldb
This will allow you to work without joins, and with a more flexible structure of your trade data.
In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.