Update huge Firebase database as a transaction - ios

I am using Firebase database for iOS application and maintain huge database. For some events I have use cloud functions to simultaneous updates of several siblings nodes as a transaction. However some nodes contains huge child nodes (may be one million). Is it worst expanding huge number of records in cloud function?

Firebase does have a limit on the size of data it can GET and POST. Take a look at the Data Tree of this site https://firebase.google.com/docs/database/usage/limits
It mentions the max depth, and size limits of objects.
Maximum depth of child nodes 32
Maximum size of a string 10 MB
If your database has millions or records you should use the query parameters and limit your request to smaller sub sections of your data.

Related

graph data science library stream too slow & how to retrieve node label(type)?

Q1.
We're trying to perform random walk and
I followed the example,
https://github.com/neo4j/graph-data-science-client/blob/main/examples/import-sample-export-gnn.ipynb
our graph consists of 170 million nodes, 1700 million edges
and set enough memory for both heap(256GB) and page(10GB)
we set sampling ratio to sample 10000 nodes,
random walk takes short time,
but retrieving graph to python dataframe via stream takes forever,
is there something that I'm missing here? e.g. indexing, etc
my understanding is, gds basically
project graph from disk to memory by gds.project
perform graph operation and hold the graph path by gds.rwr
fetch the actual node/edge properties by gds.stream
I don't think there's indexing necessary in this circumstance, I'm not an expertise in DB though.
Q2.
We're trying to stream node label from catalog graph, but I can't find any function for that.
How should I fetch node label(type)?
for Q1, after long-wait, We retrieved the graph with desired node count,
I think stream took long time because, it has 40MIL edges
is there some way to limit the edges as well?
I see that there used to be walkLength, walksPerNode
https://community.neo4j.com/t5/neo4j-graph-platform/how-is-the-number-of-random-walks-determined-in-gds/m-p/50250
is there equivalent for gds.alpha.graph.sample.rwr?

Single node with properties takes forever to query

I have a 50K node graph with 10 properties per node. Each node of the same type but with different values. Each of the properties is on an index and I have increased the heap and page cache memory sizes for the database. However using the browser console, creating the nodes takes 6 minutes!
And also a query for all the properties takes a very long time (~2 minutes) to appear in the browser console but when the results do appear the bottom of the browser says that the result of 50K node properties took only 2500 ms.
How do I improve the performance importing/querying 10's of thousands of unique instances a single node with 10 properties each and no relationships?
It takes time to update 10 different indexes for each node that you create. Do you really have use cases that require an index for every single property? If not, get rid of the indexes you do not need. Remember, indexes can speed up finding the first node(s) to initiate a query, but they do not help at all when traversing paths through a graph.
If you really need all 10 indexes, then to speed up the importing step, you can: drop all the indexes, import all 50K nodes, and then create each index one at a time (which will take some time for each index). The overall time will be about the same, but the import itself should be much faster.
It takes the neo4j browser a very long time to generate and display the visualization for a very large result (e.g., 10's of thousands of nodes). The browser is not intended for viewing that much data at one time.
1) Check that you are running a recent version of Neo4j. 3+ has optimised the way that properties are stored and indexed.
2) Check how you're running the query. Maybe your query is not optimised or is problematic in some way. Note in particular that each MATCH generates a 'row'. Multiple MATCH clauses will yield the Cartesian product of all matched sets, which could be problematic with large armounts of data.
3) Check that each of these properties needs to be attached to a node. Neo4j is optimised for searching for relationships, not for properties.
Consider turning nodes that look like this:
(:Train {
maxSpeedInKPH: 350,
fuelType: 'Diesel',
numberOfEngines: 3
})
to
(:Train)
-[:USES_FUEL_TYPE]->(:Fuel {type: 'Diesel'}),
-[:HAS_MAX_SPEED]->(:MaxSpeed {value: 350, unit: 'k/h'}),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine)
There is generally a benefit to spinning properties out into relationships, even if the uniqueness is low. For example if you have a property which has a unique value per node, generally keep that in the node. But if your 50000 nodes have less, say, 25000 unique values in that property, it would probably still be beneficial to spin them out into relationships. This is absolutely the case with integer-type properties, where you'll also be able to add additional "bucket relationships" to provide a form of indexing. In the example above, the max speed was 350. After spinning the property out into a relationship, you could also put an additional relationship of the type [:HAS_MAX_SPEED_ABOVE]-> 300. This would complicate your querying, but should make it faster.
4) If none of the above apply to you, cannot be implemented or do not help, consider switching to a more traditional relational database like SQL. SQL would be a perfect candidate for your use case, i.e. 50k different nodes (rows) with only 10 different properties (columns) and no relationships (joins).

Does bucketing two *large* tables in Hive *in the same way* help perform much more efficient joins?

Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)

DB Selection and Modeling Time Series Data with Ad-Hoc queries

I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.
Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?
Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)

Neo4j social graph performance on large respond

I need an advice about performance improving of social graph. The target query works fine with small results number. But it may return large results with more than 1000 rows. Can the performance be tuned on large respond of cypher query?
The cypher query is used:
START givenFriend=node:Nodes('id:709387498'),
item=node:ItemCat1Cat2('category:a.b')
MATCH p = givenFriend-[:FRIEND]-friend1-[:FRIEND]-friend2-[:DATA]->item
RETURN p, item
Neo4j core 1.9.5
The graph contains connected friends:
friend1Node-[:FRIEND]->friend1Node
A friend can have several data items which are represented as nodes with properties:
friendNode-[:DATA]->DataNode
A data node has about 8 properties. Among them is a category property. The data item nodes are indexed by category.
Friend nodes number: 650,772
Friend relationship number: 842,755
Data item nodes number: 5,640
The query which demands improvement should select all paths from a given node id to data item with defined category through 2 friends. The paths have the following view:
givenFriend-friend1-friend2-dataItem
Can traversal improve the performance?
Can migration to 2.0.0 improve the db model and query performance?
**UPD
I use php library https://github.com/jadell/neo4jphp
But I'm open for other variants. Right now I'm looking at neoism(Golang). Also I considered using neo4j extension to perform a query. The target query is tested through the neo4j dashboard as well. So the client layer was absent.
Fresh version of the php lib is using X-Stream. Mine is not. But as a query was tested without a client then this factor can be omitted.
The question was good. I've tuned the query - it returns not a node but properties which I need and the performance is improved a bit.
If I understand you correctly about SLA - such type of requests should work with concurrency 100 and the allowable respond time 2s per request. The query respond time through the dashboard:
LIMIT 1 = 195ms
LIMIT 100 = 564ms
LIMIT 1000 = 1549ms
LIMIT 3000 = 3208ms
SKIP 7000 LIMIT 1 = 2051ms
The respond can contain up to 13K records.
What client do you use?
do you use streaming, i.e. X-Stream:true header
Do only return the data you need, so not path or nodes but only those properties you really need to perform your use case.
2.0.1 would improve performance on the transactional endpoint
What is your SLA and your current response time? How large are the responses?

Resources