Cumulocity data export - iot

I noticed a limit of 2000 records per API call for getting collections out of Cumulocity. Will we be constrained to these limits or is there any other batch API available?

You cannot get more than 2000 records for a single collection request at the moment. But you can specify a more direct query e.g. by time and grab it in multiple requests if it exceeds the 2000 records.
Example:
/measurement/measurements?dateFrom={dateFrom}&dateTo={dateTo}
Another way would be to get data continuously pushed to you. You can use the real-time API http://cumulocity.com/guides/reference/real-time-notifications/

Related

Parallelization of user fetching with graph api and delta

I user delta query to get changes to users for a particular tenant. The algorithm looks like:
Fetch all users and save delta
Use delta to get only changes
Everything works fine however the initial call to fetch all users is very slow, as I need to follow nextLink and basically if a tenant has hugh number of users ( > 1 000 000) and max number of items per page is 999 it takes a lof of the for that synchronization.
I thought, I could parallelize it - use startswith(mail,'{a}') filter and call the api for every letter in the alphabet. The problem is that with this approach I cannot fet delta link (or I would get a delta for every call).
Is there maybe a better way to speed up user fetching ?
Delta on users does not support filtering objects today on any other property than the Id. You could request support for filtering by adding an idea in UserVoice.
As a workaround you could sync the users in parallels with a filter using the GET API (/users) and then issue a delta query with $deltatoken=latest to get a token from that point and not have to sync all the changes sequentially. This doesn't guarantee consistency though.
Lastly, sync can be made faster (using delta, without parallelization) by selecting only the properties you need.

Reading bulk data from a database using Apache Beam

I would like to know, how JdbcIO would execute a query in parallel if my query returns millions of rows.
I have referred https://issues.apache.org/jira/browse/BEAM-2803 and the related pull requests. I couldn't understand it completely.
ReadAll expand method uses a ParDo. Hence would it create multiple connections to the database to read the data in parallel? If I restrict the number of connections that can be created to a DB in the datasource, will it stick to the connection limit?
Can anyone please help me to understand how this would handled in JdbcIO? I am using 2.2.0
Update :
.apply(
ParDo.of(
new ReadFn<>(
getDataSourceConfiguration(),
getQuery(),
getParameterSetter(),
getRowMapper())))
The above code shows that ReadFn is applied with a ParDo. I think, the ReadFn will run in parallel. If my assumption is correct, how would I use the readAll() method to read from a DB where I can establish only a limited number of connections at a time?
Thanks
Balu
The ReadAll method handles the case where you have many multiple queries. You can store the queries as a PCollection of strings where each string is the query. Then when reading, each item is processed as a separate query in a single ParDo.
This does not work well for small number of queries because it limits paralellism to the number of queries. But if you have many, then it will preform much faster. This is the case for most of the ReadAll calls.
From the code it looks like a connection is made per worker in the setup function. This might include several queries depending on the number of workers and number of queries.
Where is the query limit set? It should behave similarly with or without ReadAll.
See the jira for more information: https://issues.apache.org/jira/browse/BEAM-2706
I am not very familiar with jdbcIO, but it seems like they implemented the version suggested in jira. Where a PCollection can be of anything and then a callback to modify the query depending on the element in the PCollection. This allows each item in the PCollection to represent a query but is a bit more flexible then having a new query as each element.
I created a Datasource, as follows.
ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass("com.mysql.jdbc.Driver"); // loads the jdbc driver
cpds.setJdbcUrl("jdbc:mysql://<IP>:3306/employees");
cpds.setUser("root");
cpds.setPassword("root");
cpds.setMaxPoolSize(5);
There is a better way to set this driver now.
I set the database pool size as 5. While doing JdbcIO transform, I used this datasource to create the connection.
In the pipeline, I set
option.setMaxNumWorkers(5);
option.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED);
I used a query which would return around 3 million records. While observing the DB connections , the number of connections were gradually increasing while the program was running. It used at most 5 connections on certain instances.
I think, this is how we can limit the number of connections created to a DB while running JdbcIO trnsformation to load bulk amount data from a database.
Maven dependency for ComboPoolDataSource
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
**please feel free to correct the answer if I missed something here.*
I had similar task
I got count of records from the database and split it into ranges of 1000 records
Then I apply readAll to PCollection of ranges
here is description of solution.
And thanks Balu reg. datasource configuration.

Update huge Firebase database as a transaction

I am using Firebase database for iOS application and maintain huge database. For some events I have use cloud functions to simultaneous updates of several siblings nodes as a transaction. However some nodes contains huge child nodes (may be one million). Is it worst expanding huge number of records in cloud function?
Firebase does have a limit on the size of data it can GET and POST. Take a look at the Data Tree of this site https://firebase.google.com/docs/database/usage/limits
It mentions the max depth, and size limits of objects.
Maximum depth of child nodes 32
Maximum size of a string 10 MB
If your database has millions or records you should use the query parameters and limit your request to smaller sub sections of your data.

Valence API Grade Export

I've been testing the grade export functionality using Valence and I have noticed while benchmarking that the process is very slow (about 1.5/2 seconds per user).
Here is the api call I am using:
/d2l/api/le/(D2LVERSION: version)/(D2LID: orgUnitId)/grades/(D2LID: gradeObjectId)/values/(D2LID: userId)
What I am looking to do is export a large number of grades upwards of 10k. Is this possible using this API?
An alternative to consider is to get all the grades for a particular user with GET /d2l/api/le/(version)/(orgUnitId)/grades/values/(userId)/.
(In your question, it looks like with the call you're using, you're getting the grade values one at a time for each user.)
In future, we plan to support paging of results, in order to better support the case of large class sizes + high number of grade items. We also plan to offer a call which retrieves a user's grades set across all courses.

How to display large list in the view without using database slicing?

I have a service that generates a large map through multiple iterations and calculations from multiple tables. My problem is I cannot use pagination offset to slice the data because the data is coming from multiple tables and different modifications happen on the data. To display this on the screen; I have to send the map with 10-20,000 records to the view and that is problematic with this large dataset.
At this time I have on-page pagination but this is very slow and inefficient.
One thing I thought is to dump it on a table and query it each time but then I have to deal with concurrent users.
My question is what is the best approach to display this list when I cannot use database slicing (offset, max)?
I am using
grails 1.0.3
datatables and jquery
Maybe SlickGrid! is an option for you. One of there examples works with 50000 rows and it seems to be fast.
Christian
I end up writing the result of the map in a table and use the data slicing on that table for pagination. It takes some time to save the data but at least I don't have to worry about the performance with the large data. I use time-stamp to differentiate between requests. each requests will be saved and retrieved with its time stamp.

Resources