This question ha been asked a number of times before, but I have been unable to find a full answer. I need to store data in my app using an sqlite3 database and core data is not an option. I want to synchronise the data across devices using iCloud, and the best approach to this seems to be to send SQL transaction logs to iCloud and use them to keep device up to date. The process I've come up with so far is as follows:
All database altering queries (INSERT, UPDATE, DELETE) once executed are stored in a transactions array, each element of which contains the sql query and the timestamp it was carried out
The database contains a table for logging the point in the transactions array that the app last got to (including the filename of the transactions file stored on iCloud)
Transactions array saved to device-unique file on iCloud
When syncing:
Get array of transactions files from iCloud
Create empty array of transactions to be committed
For each file:
Check database for last position got to in transaction file
If none, start from beginning of file
Add each transaction from that point to the array of transactions to be committed
Update database with new last position of transaction file so the synced transactions are not repeated
Sort the array of transactions to be committed by transaction timestamp
Execute the commands in the array of transactions to be committed
I am confident that I can get this working in terms of pulling the data down to each device and carrying out the commands to update each local copy. The only problem I envisage is if two devices insert a record to the same table while both offline and then sync. For example:
Device 1 and device 2 both have synchronised copies of the database, with four records each in the table "table1"
Device 1 inserts value "foo" to table "table1" with PK 5
Device 2 inserts value "bar" to table "table1" with PK 5
Device 1 downloads transaction log for device 2 and inserts value "bar" to ID 6
Device 2 downloads transaction log for device 1 and inserts value "foo" to ID 6
We now have a situation where the primary keys for these records are inverted on each device, which will break links to tables which rely on the primary key for linking.
I'm still trying to research a solution to this but in the meantime if anybody has an suggestions I would be extremely grateful!
I've been thinking about this all day and I think I've come up with a solution. I'm posting it here to see if anyone has any comments, and I'll be having a go at implementing it tomorrow.
My idea is to do away with an auto-increment, integer primary key and replace it with a string-based key. The key will be generated from the device's UUID and the time stamp. This means the key is device specific and row unique. So, to rephrase the INSERT example from above using this method (using simplified key strings for ease of reading):
Device 1 and device 2 both have synchronised copies of the database, with four records each in the table "table1"
Device 1 inserts value "foo" to table "table1" with PK "abc123"
Device 2 inserts value "bar" to table "table1" with PK "def456"
Device 1 downloads transaction log for device 2 and inserts value "bar" to ID "def456"
Device 2 downloads transaction log for device 1 and inserts value "foo" to ID "abc123"
This works because the devices know that the transactions will result in he insertion of rows which are keyed using device specific values. There is therefore no danger of duplicate values in the keys column after the insert operation.
Any thoughts on this approach would be welcome!
UPDATE
I'm marking this as the correct answer as it worked. Here's how I modified my existing database (and app) to allow synchronisation across devices. Fortunately this is not a production app yet so the significant changes to the database which were required have not caused a problem.
Change all database tables to use a TEXT type column as their primary key
Add a "last index" table to the database which is keyed with the table name and has one other column to show the index number of the last row added to that table
Added a method to the app which generates a device- and row-unique TEXT key by retrieving the last inserted index for the table and incrementing it by one, and then adding to this the device's UUID.
When inserting any rows, the method described in (3) is invoked to obtain an appropriate key for inserting the record into the database
All database-modifying functions result in the SQL query being added to a transaction log, each entry of which is also date and time stamped and saved to a local file using the device's UUID as its filename
The device's local transaction log is then pushed up to iCloud
Syncing involves the following:
Download all of the transaction logs from iCloud
Ignoring the device's own transaction log, go through the list of SQL commands in each log and add them to an array of commands, starting from the index number finished at for that log on the last sync
Store the index number finished at for the log in the database so the same commands are not re-executed on the next sync
Sort the array of all SQL commands from all logs by date
Execute the commands in order
With regard to implementation of this process, I have got as far as step 5. None of the iCloud-specific stuff has been implemented yet. However, I have tested the process by manually copying the transaction logs between devices running copies of the app and I can confirm that the process works. The TEXT primary key containing the device's UUID ensures that no clashes occur. All devices can insert each other's data and the keys will always be unique.
The downside of this is that the database will be larger (because the keys are longer than an integer), and queries probably take longer as there is lots of string comparison involved. However, the database I am using is relatively small and there are just a few queries per user interaction so I do not anticipate a problem here.
I hope this is useful to other who come along with the same question :)
Related
My last record's primary key was 552 & when I added a new record it's primary key allotted is 584.
I'm surprised & would like to know possible reasons for this behavior.
Application Details :
Server: Heroku hobby plan - dyno
Database: Heroku Postgres
Framework: Ruby on Rails
Additional Info -> I'm using rails admin panel to add new record
Possible reasons:
Some records were added+deleted
Inserting transaction was reverted for some reason, From postgres manual:
Note: Because smallserial, serial and bigserial are implemented using sequences, there may be "holes" or gaps in the sequence of values which appears in the column, even if no rows are ever deleted. A value allocated from the sequence is still "used up" even if a row containing that value is never successfully inserted into the table column. This may happen, for example, if the inserting transaction rolls back.
Corresponding sequence table_name_seq has increment more than 1 (probably not your case, sometimes is useful for sharding)
We have a scenario where we want to frequently change the tag of a (single) measurement value.
Our goal is to create a database which is storing prognosis values. But it should never loose data and track changes to already written data, like changes or overwriting.
Our current plan is to have an additional field "write_ts", which indicates at which point in time the measurement value was inserted or changed, and a tag "version" which is updated with each change.
Furthermore the version '0' should always contain the latest value.
name: temperature
-----------------
time write_ts (val) current_mA (val) version (tag) machine (tag)
2015-10-21T19:28:08Z 1445506564 25 0 injection_molding_1
So let's assume I have an updated prognosis value for this example value.
So, I do:
SELECT curr_measurement
INSERT curr_measurement with new tag (version = 1)
DROP curr_mesurement
//then
INSERT new_measurement with version = 0
Now my question:
If I loose the connection in between for whatever reason in between the SELECT, INSERT, DROP:
I would get double records.
(Or if I do SELECT, DROP, INSERT: I loose data)
Is there any method to prevent that?
Transactions don't exist in InfluxDB
InfluxDB is a time-series database, not a relational database. Its main use case is not one where users are editing old data.
In a relational database that supports transactions, you are protecting yourself against UPDATE and similar operations. Data comes in, existing data gets changed, you need to reliably read these updates.
The main use case in time-series databases is a lot of raw data coming in, followed by some filtering or transforming to other measurements or databases. Picture a one-way data stream. In this scenario, there isn't much need for transactions, because old data isn't getting updated much.
How you can use InfluxDB
In cases like yours, where there is additional data being calculated based on live data, it's common to place this new data in its own measurement rather than as a new field in a "live data" measurement.
As for version tracking and reliably getting updates:
1) Does the version number tell you anything the write_ts number doesn't? Consider not using it, if it's simply a proxy for write_ts. If version only ever increases, it might be duplicating the info given by write_ts, minus the usefulness of knowing when the change was made. If version is expected to decrease from time to time, then it makes sense to keep it.
2) Similarly, if you're keeping old records: does write_ts tell you anything that the time value doesn't?
3) Logging. Do you need to over-write (update) values? Or can you get what you need by adding new lines, increasing write_ts or version as appropriate. The latter is a more "InfluxDB-ish" approach.
4) Reading values. You can read all values as they change with updates. If a client app only needs to know the latest value of something that's being updated (and the time it was updated), querying becomes something like:
SELECT LAST(write_ts), current_mA, machine FROM temperature
You could also try grouping the machine values together:
SELECT LAST(*) FROM temperature GROUP BY machine
So what happens instead of transactions?
In InfluxDB, inserting a point with the same tag keys and timestamp over-writes any existing data with the same field keys, and adds new field keys. So when duplicate entries are written, the last write "wins".
So instead of the traditional SELECT, UPDATE approach, it's more like SELECT A, then calculate on A, and put the results in B, possibly with a new timestamp INSERT B.
Personally, I've found InfluxDB excellent for its ability to accept streams of data from all directions, and its simple protocol and schema-free storage means that new data sources are almost trivial to add. But if my use case has old data being regularly updated, I use a relational database.
Hope that clear up the differences.
When previousServerToken is null, CKFetchRecordChangesOperation seems to take several passes to download the first set of data, retrying until the moreComing flag is clear.
It isn't because there are too many records- In my testing I only have around 40 member records, each of which belong to one of the 6 groups.
The first pass gives two badly-formed member records; the second pass sometimes sends a few member records from a group that has not yet been downloaded, or nothing. Only after the third pass does it download all the remaining groups and members as expected.
Any ideas why this might be?
This can happen if the zone has had lots of records deleted in it. The server scans through all of the changes for the zone and then drops changes for records that have been deleted. Sometimes this can result in a batch of changes with zero record changes, but moreComing set to true.
Take a look at the new fetchAllChanges flag on CKFetchRecordZoneChangesOperation in iOS 10/macOS 10.12. CloudKit will pipeline fetch changes requests for you and you'll just see record changes and zone change tokens until everything in the zone has been fetched.
This is the problem it caused, and what I had to do about it...
I have two types of record- groups, and members (which must have a group as their parent.)
The problem is that, although CloudKit normally returns parents of records first, it will only do this within a single batch.
Members might therefore be received before their parent group if it is in a different batch (which can happen if a group has been subsequently edited or renamed, as that moves it later in the processing order)
If you are using arrays on your device to represent the downloaded data, you therefore need to either cache members across a series of batches, and process them at the end (after all groups have been received,) or allow a record to create a temporary dummy group that is overwritten with that groups name and other data when it eventually arrives.
We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.
I have ten master tables and one Transaction table. In my transaction table (it is a memory table just like ClientDataSet) there are ten lookup fields pointing to my ten master tables.
Now i am trying to dynamically assigning key field values to all my lookup key field values (of the transaction table) from a different Server(data is coming as a soap xml). Before assigning these values i need to check whether the corresponding result value is valid in master tables or not. I am using a filter (eg status = 1 ) to check whether it is valid or not.
Currently how we are doing is, before assigning each key field value we are filtering the master tables using this filter and using the locate function to check whether it is there or not. and if located we will assign its key field value.
This will work fine if there is only few records in my master tables. Consider my master tables having fifty thousand records each (yeah, customer is having so much data), this will lead to big performance issue.
Could you please help me to handle this situation.
Thanks
Basil
The only way to know if it is slow, why, where, and what solution works best is to profile.
Don't make a priori assumptions.
That being said, minimizing round trips to the server and the amount of data transferred is often a good thing to try.
For instance, if your master tables are on the server (not 100% clear from your question), sending only 1 Query (or stored proc call) passing all the values to check at once as parameters and doing a bunch of "IF EXISTS..." and returning all the answers at once (either output params or a 1 record dataset) would be a good start.
And 50,000 records is not much, so, as I said initially, you may not even have a performance problem. Check it first!