With Sumo Logic, what is the difference between 'cluster' and '_sourceCategory'?
I've tried looking at the documentation but am not finding anything for cluster itself. If you know, please share the knowledge.
There is nothing like cluster in sumo logic.
It is _sourceCategory and _sourceHost.
_sourceCategory basically just means the name of categories to which these logs belong. For example: If you are ingesting logs of a service named X, you can put its _sourceCategory as X and then search for it with query _sourceCategory=X
If you cluster name is Y and your nodes are numbered Y-1,Y-2 ... Y-10, then you can search it like _sourceHost=Y*. This would give you all the logs for cluster Y.
Related
I'm using Hbase coupled with phoenix for interractive analytics and i'm trying to desing my hbase row key for an iot project but i'm not very sure if i'm doing it right.
My Database can be represented into something like this :
Client--->Project ----> Cluster1 ---> Cluster 2 ----> Sensor1
Client--->Project ----> Building ----> Sensor2
Client--->Project ----> Cluster1 ---> Building ----> Sensor3
What i have done is a Composite primary key of ( Client_ID, Project_ID,Cluster_ID,Building_iD, SensorID)
(1,1,1#2,0,1)
(1,1,0,1,2)
(1,1,1,1,3)
And we can specify multiple Cluster or building with a seperator # 1#2#454 etc
and if we don't have a node we insert 0.
And in the columns family we will have the value of the sensor and multiples meta_data.
My Question is this hbase row key design for a request that say we want all sensors for the cluster with ID 1 is valid ?
I thought also to just put the Sensor_ID,TimeStamp in the key and put all the rooting in the column family but with this design im not sure its a good fit for my requests .
My third idea for this project is to combine neo4j for the rooting and hbase for the data.
Anyone got any experience on similar problems to guide me on the best approach to design this database ?
It seems that you are dealing with time series data. Once of the main risks of using HBase with time series data (or other forms of monotonically increasing keys) is hotspotting. This is dangerous scenario that might arise and make your cluster behave as a single machine.
You should consider OpenTSDB on top of HBase as it approaches the problem quite nicely. The single most important thing to understand is how it engineers the HBase schema/key. Note that the timestamp is not in the leading part of the key and it assumes a number of distinct metric_uid >>> of the number of slave nodes and region servers (This is essential for a balanced cluster).
An OpenTSDB key has the following structure:
<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
Depending on your specific use case you should engineer your metric_uid appropriately (maybe a compound key unique to a sensor reading) as well as the tags. Tags will play a fundamental role in data aggregation.
NOTE: As of v2.0 OpenTSDB introduced the concept of Trees that could be very helpful to 'navigate' your sensor readings and facilitate aggregations. I'm not too familiar with them but I assume that you could create a hierarchical structure that will help determining which sensors are associated with which client, project, cluster, building, and so on...
P.S. I don't think that there is room for Neo4J in this project.
I am writing my master thesis with Neo4j Database and I meet a problem. I need your help.
The picture at left is the data I saved in Neo4j, the whole picture represents how an application could be deployed in cloud. Every node represents a service.
For example, I have an Apach Module and I can "hosted_on" an Apache Server. The green line represents a possible option, because an Apache server can hosted on a Windows system or a Linux system.
So there are two possibilities for deployment, showed at right.
At right is what I want, I call it topology, it defines how an application deployment looks like.
what I want is to retrieve all possible typologies.
How I can get these possibilities topology by Cypher or Java traverse API?
Thanks very much..
I'm not sure if this is what you are getting at, but it might be helpful to consider the "What is related and how?" query:
// What is related, and how
MATCH (a)-[r]->(b)
WHERE labels(a) <> [] AND labels(b) <> []
RETURN DISTINCT head(labels(a)) AS This, type(r) as To, head(labels(b)) AS That
LIMIT 10
This will return Node labels and relationship names that are connected by at least one relationship in the graph. Is that what you mean by topology?
I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..
You basically want to cluster the users according to their tags.
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.
You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.
You should consider using neo4j. You can model your data using the following node labels and relationship types.
If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo, and [:BAR] represents a relationship with the type BAR. The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.
(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)
You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.
With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.
in erlang otp when making a gen_server:call(), you have to send in the name of the node, to which you are making the call.
Lets say I have this usecase:
I have two nodes: 'node1' and 'node2' running. I can use those nodes to make gen_server:call() to each other.
Now lets say I added 2 more nodes: 'node3' and 'node4' and pinged each other so that all nodes can see and make gen_server:calls to each other.
How do the erlang pros handle the dynamic adding of new nodes like that so that they know the new node names to enter into the gen_server calls, or is it a requirement to know beforehand the names of all the nodes so that they are hardcoded in somewhere like the sys.config?
you can use:
erlang:nodes()
to get a "now" view of the node list
Also, you can use:
net_kernel:monitor_nodes(true) to be notified as nodes come and go (via ping / crash / etc)
To see if your module is running on that node you can either call the gen_server with some kind of ping callback
That, or you can use the rpc module to call erlang:whereis(name) on the foreign node.
Given a set of hundred of thousands of nodes with like relationship, (Foodie) -likes-> (Food), I would like to find out logical cluster of Foodie nodes.
For instance suppose I want to divide the cluster into two sets. As an output I would like two sets which have the most common eating habits.
The same logic can be extended to 3,4,5 sets etc. In case of three sets, each set would have most like eating habits. Please note that sets may NOT have same number of nodes.
An application for instance could be coloring of nodes. If the foodies were of different countries, the color of the nodes could point to various countries assuming the people of different countries ate similar food.
I would like to write a Cypher query to extract the nodes. I am stumped as where to start. Any solution or pointers would be appreciated.
what about trying the current milestone of Neo4J 2.0 (http://www.neo4j.org/download, milestone section) and assigning your nodes different labels according to their characteristics (http://www.neo4j.org/develop/labels)?
Then, you'll only have to Cypher execute queries like:
MATCH (nodes:MY_LABEL)
WHERE /.../
RETURN nodes
so that you can retrieve nodes by clusters.
You might want to look into Cliques. This is a general Graph Theory idea, but it sounds like what you want is to define certain 'cliques' of foodies, say there are BBQ foodies, Food Truck foodies, etc.