Ranking pcollection elements - google-cloud-dataflow

I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.

This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.
I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.

Related

Sum quantities in tree structure

I have a tree structure containing quantities in property of each node.
I want to build up the sums like adding the quantities of child nodes and then multiply the sum with the quantity of the parent. This calculated quantity will then be used in the next parent when he collects the child quantities.
I can not modify a property within the node because the structure is used for calculating quantities in different sections of the tree.
I attached virtual nodes to the existing tree containing copies of the quantities. The problem is: I cannot execute matches on the virtual nodes and their relations. Is there a way to use a mixture of "real" nodes and virtual nodes as a database to execute cypher queries on them?
I am open to alternative solutions...
Thanks

Representing a list of items in Neo4j

Suppose you have a list of items (instructions in a function, posts on a blog, episodes in a TV series etc) that need to be kept in order, what is the recommended way to store them in Neo4j? Two possibilities that come to mind:
Assuming the items don't already have a suitable property for sorting by, assign them incrementing sequence numbers.
Use a linked list of nodes.
Which of these is typically recommended? Or is there a third option I'm missing?
Use a linked list.
Sequence numbers still have to be sorted, which is unnecessary overhead. And to do the sort, neo4j has to iterate through every node in the sequence, even if you are only interested in a small part of the sequence.

Aggregating and ordering over relationship properties

I'm a Cypher newbie so I might be missing something obvious but after reading all the basic recommendation engine posts/tutorials I could find, I can't seem to be able to solve this so all help is appreciated.
I'm trying to make a recommendation function that recommends Places to User based on Tags from previous Places she enjoyed. User have LIKES relationship to Tag which carries weight property. Places have CONTAINS relationship with Tag but Contain doesn't have any weights associated with it. Also the more Tags with LIKES weighted above certain threshold (0.85) Place has, the higher it should be ordered so this would add SUM aggregator.
(User)-[:LIKES]->(Tag)<-[:CONTAINS]-(Place)
My problem is that I can't wrap my head around how to order Places based on the amount of Tags pointing to it that have LIKES relationship with User and then how to use LIKES weights to order Places.
Based on the following example neo4j console : http://console.neo4j.org/r/klmu5l
The following query should do the trick :
MATCH (n:User {login:'freeman.williamson'})-[r:LIKES]->(tag)
MATCH (place:Place)-[:CONTAINS]->(tag)
WITH place, sum(r.weight) as weight, collect(tag.name) as tags
RETURN place, size(tags) as rate, weight
ORDER BY rate DESC, weight DESC
Which returns :
(42:Place {name:"Alveraville"}) 6 491767416
(38:Place {name:"Raynorshire"}) 5 491766715
(45:Place {name:"North Kristoffer"}) 5 491766069
(36:Place {name:"Orrinmouth"}) 5 491736638
(44:Place {name:"New Corachester"}) 5 491736001
Explanation :
I match the user and the tags he likes
I match the places containing at least one tag he likes
Then I use WITH to pipe the sum of the rels weights, a collection of the tags, and the place
Then I return those except I count with size the number of tags in the collection
All ordered in descending order

DB Selection and Modeling Time Series Data with Ad-Hoc queries

I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.
Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?
Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)

Informix: Limitation on Number of Items in IN Clause?

Is there a limitation to the number of items that can go into an IN clause in an Informix query (like the 1000 item limit in Oracle)?
We have a "large" (perhaps 2000) list of item numbers being passed through a web service for selection, so there isn't really any context available beyond the list of items.
The upper limit is imposed by the space that will be taken to create the IN list and the 64 KiB limit on statements. You can typically get to several thousand smallish (6-7 digit) integers without much problem at the syntactic level.
However, you may find that the performance is not as good as creating a temporary table, inserting the several thousand values into that, and then writing the main query to join with that temporary table.

Resources