how to design my dataset using neo4j and gremlin - neo4j

i have a dataset containg fields like below:
id amount date s_pName s_cName b_pName b_cName
1 100 2/3/2012 IBM IBM_USA Pepsi Pepsi_USA
2 200 21/3/2012 IBM IBM_USA Coke Coke_UK
3 300 12/3/2012 IBM IBM_USA Pepsi Pepsi_USA
4 1100 22/3/2012 Pepsi IBM_Aus IBM IBM_USA
here all 4 fields like s_pName s_cName b_pName b_cName can be saler or buyer.
how to models this dataset in neo4j so that when I query using gremlin like,
select b_CName,id,amount,date from tableName where s_cName = IBM_USA,IBM_AUS;

I noted your question on the gremlin-users mailing list as well (where you provided a bit more information about things you'd tried): https://groups.google.com/forum/#!topic/gremlin-users/AxsF2eJvpOA
I'm sure there are a few ways to approach this modelling issue, so I'll just provide some things to consider and hopefully that will inspire you to solution. First, instead of thinking of buyers and sellers, just think about the fact that you have "companies" that sells things to other companies and that companies have hierarchy (meaning that a company can have a parent). Your model then comes down to:
company --sellsTo--> company
company --parent--> company
Place your transaction amount and date on the "sellsTo" edge creating one such edge per row in your dataset. Create a key index on the "companyName" field of the company vertex so that you can look up the company. Your Gremlin would then be something like:
['IBM_USA','IBM_AUS'].collect{g.V('companyName',it).next()}._().outE('sellsTo').as('tx').inV.as('buyer').select{[it.id, it.amount, it.date]}{it.companyName}
so breaking that down you do a lookup of your two companies you care about by key index on companyName and get them into a pipeline with _(). Then you traverse out to the companies those two companies sold to. You use select to grab the tx (transaction edge) and buyer vertex executing a closure on each of them to transform them into the fields you want which will yield you something like (for one result, your Gremlin would likely return several of these with your full dataset obviously):
[[1,100,2/3/2012],Pepsi_USA]
You could use some Groovy JDK (http://groovy.codehaus.org/groovy-jdk/) operations to transform it further from there if that's not the final format you need.

Related

Combining additive and semi-additive facts in a single report

I'm working on a quarterly report. The report should look something like this:
col
Calculation
Source table
Start_Balance
Sum at start of time period
Account_balance
Sell Transactions
Sum of all sell values between the two time periods
Transactions
Buy Transactions
Sum of all buy values between the two time periods
Transactions
End Balance
Sum at the end of time period
Account_balance
so e.g.
Calculation
sum
Start_Balance
1000
Sell Transactions
500
Buy Transactions
750
End Balance
1250
The problem here is that I'm working with a relational star schema, one of the facts is semi-additive and the other is additive, so they behave differently on the time dimension.
In my case I'm using Cognos analytics, but I think this problem goes for any BI tool. What would be best practice to deal with this issue? I'm certain I can come up with some sql query that combines these two tables into one table which the report reads from, but this doesn't seem like best practice, or is it? Another approach would be to create some measures in the BI tool, I'm not a big fan of this approach because it seems to be least sustainable approach, and I'm unfamiliar with it.
For Cognos you can stitch the tables
The technique has to do with how Cognos aggregates
Framework manager joins are typically 1 to n for describing the relationship
A star schema having the fact table in the middle and representing the N
with all of the outer tables describing/grouping the data, representing the 1
Fact tables, quantitative data, the stuff you want to sum should be on the many side of the relationship
Descriptive tables, qualitative data, the stuff you want to describe or group by should be on the 1 (instead of the many)
To stitch we have multiple tables we want to be facts
Take the common tables that you would use for grouping, like the period (there are probably some others like company, or customer, etc)
Connect each of the fact tables with the common table (aka dimension) like this:
Account_balance N to 1 Company
Account_balance N to 1 Period
Account_balance N to 1 Customer
Transactions N to 1 Company
Transactions N to 1 Period
Transactions N to 1 Customer
This will cause Cognos to perform a full outer join with a coalesce
Allowing you to handle the fact tables even though they have different levels of granularity
Remember with an outer join you may have to handle nulls and you may need to use the summary filter depending on your reporting needs
You want to include the common tables on your report which might conflict with how you want the report to look
An easy work around is to add them to the layout and then set the property to box type none so the sql behaves you want and the report looks the way you want
You'll probably need to setup determinants in the Framework Manager model. The following does a good job in explaining this:
https://www.ibm.com/docs/en/cognos-analytics/11.0.0?topic=concepts-multiple-fact-multiple-grain-queries

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

Solr Join - getting data from different index

I'm working on a project where we have 2 million products and have 50 clients with different pricing scheme. Indexing 2M X 50 records is not an option at the moment. I have looked at solr join and cannot get it to work the way i want it too. I know it's like a self join so I'm kinda skeptical it would work but here it is anyway.
here is the sample schema
core0 - product
core1 - client
So given a client id, i wanted to display all bags manufactured by Samsonite sorted by lowest price.
If there's a better way of approaching this, I'm open to redesigning exciting schema.
Thank you in advance.
Solr is not a relational database. You should give a look at the sharding feature and split your indexes. Also, you could write your custom plugins to elaborate the price data based on client's id/name/whatever at index time (BAD you'll still get a product replicated for each client).
How we do (so you can get an example):
clients are handled by sqlite
products are stored in solr with their "base" price
each client has a "pricing rule" applied via custom query handler when they interrogate the db (it's just a value modifier)

SOLR Analysis Query

I have a SOLR instance with millions of documents. The schema is well defined (i.e. all fields are typed). All the searching/faceting etc. works ok without any issues.
However, I am trying to do something new which I "think" is not supported in current version. I am running SOLR 3.5 on Windows using Jetty.
To simplify the question, my document contains some fields like:
Id,
Name,
City,
JobTitle
Lets say I have a sample data like:
P Wood, London, Director
J Smith, London, Project Manager
D Lock, Brighton, Developer
K Pracy, London, Developer
For the sake of example, assume that this is a matching system which allows people to find each other. Also assume that Id is unique Id.
I want to write a "sampling" query which should find me the set of records that will match other records for any criteria.
So for example, I want to define a criteria like:
Find me the people who will match people in different cities with differfent job titles:
If the above schema was a RDBMS-SQL table (lets say People), the approximate query would be something like this:
SELECT P.Id,
(
SELECT COUNT(1)
FROM People PI Where PI.Id != P.Id
AND PI.City != P.City
AND PI.JobTitle != P.JobTitle
) AS FindCount
FROM
People P
Well, the query may not be workable but you get the idea. Anyway, there are other requirements also that Findcount should be greater than x and less than y.
Can someone let me know if this is possible in SOLR or if this is something not meant for SOLR. I know SOLR 4 is coming with a Join operator but that seems to me more like an IN clause which limits the use. For example, consider that I want the matching Id's also in above query rather than counts.
I don't think that is doable in 1 query and you might end up with running "inner select" as separate query for every person

rails 3 + activerecord: is there a single query to count(field1) grouped by field2?

I'm trying to find the best way to summarize the data in a table
I have a table Info with fields
id
region_number integer (NOT associated with another table)
member_name string
member_active T/F
Members belong to a region, have a name, and are either active or not.
I'm wondering if there is a single query that will create a table with 3 columns, and as many rows as there are unique region_numbers:
For each unique region_number:
region_number
COUNT of members in that region
COUNT of members in that region with active=TRUE
Suppose I have 50 regions, I see how to do it with 2x50 queries but that surely is not the right approach!
You can always group on several things if you're prepared to do a tiny bit of post-processing:
SELECT region_number, COUNT(*) AS instances, member_active
GROUP BY region_number, member_active
WHERE region_number IN ?
This allows you do to one query for all region numbers at the same time. There will be one row for the T values, one for the F, but only if those are present.
If you see a case where you're doing a lot of queries that differ only in identifiers, that's something you can usually execute in one shot like this.

Resources