HBase Row Key Design - neo4j

I'm using Hbase coupled with phoenix for interractive analytics and i'm trying to desing my hbase row key for an iot project but i'm not very sure if i'm doing it right.
My Database can be represented into something like this :
Client--->Project ----> Cluster1 ---> Cluster 2 ----> Sensor1
Client--->Project ----> Building ----> Sensor2
Client--->Project ----> Cluster1 ---> Building ----> Sensor3
What i have done is a Composite primary key of ( Client_ID, Project_ID,Cluster_ID,Building_iD, SensorID)
(1,1,1#2,0,1)
(1,1,0,1,2)
(1,1,1,1,3)
And we can specify multiple Cluster or building with a seperator # 1#2#454 etc
and if we don't have a node we insert 0.
And in the columns family we will have the value of the sensor and multiples meta_data.
My Question is this hbase row key design for a request that say we want all sensors for the cluster with ID 1 is valid ?
I thought also to just put the Sensor_ID,TimeStamp in the key and put all the rooting in the column family but with this design im not sure its a good fit for my requests .
My third idea for this project is to combine neo4j for the rooting and hbase for the data.
Anyone got any experience on similar problems to guide me on the best approach to design this database ?

It seems that you are dealing with time series data. Once of the main risks of using HBase with time series data (or other forms of monotonically increasing keys) is hotspotting. This is dangerous scenario that might arise and make your cluster behave as a single machine.
You should consider OpenTSDB on top of HBase as it approaches the problem quite nicely. The single most important thing to understand is how it engineers the HBase schema/key. Note that the timestamp is not in the leading part of the key and it assumes a number of distinct metric_uid >>> of the number of slave nodes and region servers (This is essential for a balanced cluster).
An OpenTSDB key has the following structure:
<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
Depending on your specific use case you should engineer your metric_uid appropriately (maybe a compound key unique to a sensor reading) as well as the tags. Tags will play a fundamental role in data aggregation.
NOTE: As of v2.0 OpenTSDB introduced the concept of Trees that could be very helpful to 'navigate' your sensor readings and facilitate aggregations. I'm not too familiar with them but I assume that you could create a hierarchical structure that will help determining which sensors are associated with which client, project, cluster, building, and so on...
P.S. I don't think that there is room for Neo4J in this project.

Related

Basic influxdb cardinality

I have developed a project using influxdb and I am currently trying to understand why my influx container keeps crashing due to oom exits.
The way I designed my database is quite basic. I have several buildings, for each building, I need to have timebased values. So I created a database for each building, and a measurement for each type of value (for example energy consumption).
I do not use tags at all, because using the design I described above, all I have left to store is the float values and their timestamp index. I like this design because every building is completely separated from the others (as they should be), and if I want to get data from one of them, I just need to connect to the building's database (or bucket) and query it like so :
SELECT * FROM field1,field2 WHERE time>d1 and time<d2
According to this influx article, if I understand correctly (english isn't my first langage), I have a cardinality of:
3 buildings (bucket/database) * 1000 fields (measurement) * 1 (default tag ?) = 3000 cardinality
This doesn't seem to be much, thus I think I misunderstand something.

NEO4J: What is a good practice to store a returned path with addtional information (e.g. a concrete train run in a metro network)?

Say there is a metro network with n stops, each represented by a NEO4J node with the rail connection between two stops represented by a relationship.
I wish to store the fact train_run that e.g. Train 01234 ran from stop n1 to stop n4 via stops n2 and n3 at a certain time. I wish to store this information in a manner that must be consistent with the existing DB information regarding the metro network, hence preventing the creation of any train_run along a path that doesn't exist (e.g jumping stop n3).
What would be a good way to achieve this?
Is there a useful way to store in the Neo4J DB a path p returned from that DB jointly with the properties .train_number and time_stamp? Or should I consider a totally different approach?
Thanks for your thoughts.
you can use a structure like this to represent your data . Train to source and destination is optional .That will just help you in finding number of trains between source and destination efficiently . If there are multiple trains between two stops , you need to have multiple relations , one for each train no.

Are multiple vertex labels in Gremlin/Janusgraph possible, or is an alternative solution better?

I am working on an import runner for a new graph database.
It needs to work with:
Amazon Neptune - Gremlin implementation, has great infrastructure support in production, but a pain to work with locally, and does not support Cypher. No visualization tool provided.
Janusgraph - easy to work with locally as a Gremlin implementation, but requires heavy investment to support in production, hence using Amazon Neptune. No visualization tool provided.
Neo4j - Excellent visualization tool, Cypher language feels very familiar, even works with Gremlin clients, but requires heavy investment to support in production, and there appears to be no visualization tool that is anywhere nearly as good as the one found in Neo4j that works with Gremlin implementations.
So I am creating the graph where the Entity (Nodes/Verticies) have multiple Types (Labels), some being orthogonal to each other, as well as multi-dimensional.
For example, an Entity representing an order made online would be labeled as Order, Online, Spend, Transaction.
| Spend Chargeback
----------------------------------------
Transaction | Purchase Refund
Line | Sale Return
Zooming into the Spend column.
| Online Instore
----------------------------------------
Purchase | Order InstorePurchase
Sale | OnlineSale InstoreSale
In Neo4j and its Cypher query language, this proves to be very powerful for creating Relationships/Edges across multiple types without explicitly knowing what transaction_id values are in the graph :
MATCH (a:Transaction), (b:Line)
WHERE a.transaction_id = b.transaction_id
MERGE (a)<-[edge:TRANSACTED_IN]-(b)
RETURN count(edge);
Problem is, Gremlin/Tinkerpop does not natively support multiple Labels for its Verticies.
Server implementations like AWS Neptune will support this using a delimiter eg. Order::Online::Spend::Transaction and the Gremlin client does support it for a Neo4j server but I haven't been able to find an example where this works for JanusGraph.
Ultimately, I need to be able to run a Gremlin query equivalent to the Cypher one above:
g
.V().hasLabel("Line").as("b")
.V().hasLabel("Transaction").as("a")
.where("b", eq("a")).by("transaction_id")
.addE("TRANSACTED_IN").from("b").to("a")';
So there are multiple questions here:
Is there a way to make JanusGraph accept multiple vertex labels?
If not possible, or this is not the best approach, should there be an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label (Transaction) or the low-level label (Order)?
Is there a way to make JanusGraph accept multiple vertex labels?
No, there is not a way to have multiple vertex labels in JanusGraph.
If not possible, or this is not the best approach, should there be
an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label
(Transaction) or the low-level label (Order)?
I'll answer these two together. Based on what you have described above I would create a single label, probably named Transaction, and with different properties associated with them such as Location (Online or InStore) and Type (Purchase, Refund, Return, Chargeback, etc.). Looking at how you describe the problem above you are really talking only about a single entity, a Transaction where all the other items you are using as labels (Online/InStore, Spend/Refund) are really just additional metadata about how that Transaction occurred. As such the above approach would allow for simple filtering on one or more of these attributes to achieve anything that could be done with the multiple labels you are using in Neo4j.

Integrate multiple same structure datasets in one database

I have 8 different datasets with the same structure. I am using Neo4j and need to query all of them at different points on the website I am developing. What would be the approaches at storing the datasets in one database?
One idea that comes to my mind is to supply for each node an additional property that would distinguish nodes of one dataset from nodes of the other ones. But that seems too repetitive and wrong for me. The other idea is just to create 8 databases and query them separately but how could I do that? Running each one in its own port seems crazy.
Any suggestions would be greatly appreciated.
If your datasets are in a tree structure, you could add a different root node to each of them that you could use for reference, similar to GraphAware TimeTree. Another option (better than a property, I think) would be to differentiate each dataset by adding a specific label to nodes from that dataset (i.e. all nodes from "dataset A" get a :DataSetA label)
I imagine that the specific structure of your dataset may yield other options. For example, if you always begin traversals of the dataset from a few set locations, you only need to be able to determine which dataset the entry points are a part of, because once entered, all traversals would be made within the same dataset <-- if that makes sense.

Data Partitioning & Multi-tenancy With InsightEdge

As I understand data partitioning in XAP, it is defined at the space level i.e. there are primary and back copies of a space. This partitioning cannot be controlled at a more granular level i.e. for e.g. in case of dynamic models at the document type level.
In my use case I have dynamic models that include facts (large datasets) and dimensions (small datasets). I would want to partition the facts but keep a copy of the dimensions on every node in the cluster. By defining a routing index I could specify a property in the fact document to be the partitioning key.
How do I make the dimensions (small datasets) available on all the XAP slave nodes so that they speed up the performance while performing joins with various fact documents?
Can I re-partition a document type if my routing property changes at run-time?
In a multi-tenant deployment (tenant = customer) I imagine designing a space per tenant and securing it by a username/password would be the right approach. If for some reason a space instance for a client gets corrupted. Does it affect other spaces? How can I restore one space in a multi-tenant clustered deployment?
Disclaimer: I work for GigaSpaces as the PM for XAP and InsightEdge. Hope this helps:
The typical data modeling approach to facts and dimensions would by routing facts with dimensions. Meaning, the facts routing key would be the same value as the dimension's key. This guarantees data locality when accessing many dimension objects that are associated with a specific fact. This is a good reference: http://docs.gigaspaces.com/sbp/modeling-your-data.html
If you are looking to join dimensions across many partitions, then there are two approaches: 1) Either use executor based remoting services and invoke a method using broadcast mode (http://docs.gigaspaces.com/xap120/executor-based-remoting.html#broadcast-remoting) or the simpler approach: 2) use Spark SQL from InsightEdge
The routing property is fixed once you specify the space type descriptor, it cannot be changed at runtime from a field to another. If you're looking to change it's value, then a simple "take" operation followed by a value change then space write should do.
It is the right approach if you want to isolate tenants at the JVM level. No, this won't affect other spaces. The best way to recover a space after a restart (from a persistent store) is to use the Space Data Source API.

Resources