We synchronize data from QuickBooks Desktop to our web service. For every session, we want to pick only new and modified data (compare to data in the database). So, we set FromModifiedDate filter and set it to the latest modified time of records in our database.
The problem is that data returns is not in order of ModifiedTime. QuickBooks desktop can return the latest record to us. Assume that there are n records in the result set, with ModifiedTime t1 to tn (where tn is the latest). On first iteration, QuickBooks might return record n (ModifiedTime = tn) and we saved it to the database. There is an interruption on the next iteration. On the next run we ask for record from tn, and miss all records that has not synchronized on the previous run (t1, ...).
Is there a way to specifies that the result set is ordered by ModifiedTime and always returned oldest modified records first? (e.g, first iteration, t1-t5, next t6-t10)
Is there a way to specifies that the result set is ordered by ModifiedTime and always returned oldest modified records first? (e.g, first iteration, t1-t5, next t6-t10)
No, QuickBooks desktop does not support this.
It sounds like this isn't really your issue though. Can you clarify what you mean by:
There is an interruption on the next iteration.
What do you mean "an interruption"? What sort of interruption are you anticipating? You should be processing every record you get back, every time. Remove the "interuption" from your application and you won't have any issues, right?
Related
Can you please help me to understand this by taking below example.
Group by cust_id,item_id.
what records will process to caches(index/data) in both scenarios with sorted input and unsorted input?
What will be case if cache memory runs out?Which alogritham it uses to perform aggregate calculations internally?
I don't know about internal algorithm, but in unsorted mode, it's normal for the Aggregator to store all rows in cache and wait for the last row, because it could be the first that must be returned according to Aggregator rules ! The Aggregator will never complain about the order of incoming rows. When using cache, it will store rows first in memory, then when the allocated memory is full, it will push cache to disk. If it runs out of disk space, the session will fail (and maybe others because of that full disk). You will have to clean those files manually.
In sorted mode, there is no such problem : rows come in groups ready to be aggregated and the aggregated row will go out as soon as all rows from a group are received, which is detected when one of the values of the keys changes. The Aggregator will complain and stop if rows are not in expected order. However it pushes the problem upward to the sorting part, that could be a Sorter, which can use a lot of cache itself, or the database with an ORDER BY clause in the SQL query that could take resources on the database side.
Be careful also that SQL ORDER BY may use a different locale than Informatica.
We have a scenario where we want to frequently change the tag of a (single) measurement value.
Our goal is to create a database which is storing prognosis values. But it should never loose data and track changes to already written data, like changes or overwriting.
Our current plan is to have an additional field "write_ts", which indicates at which point in time the measurement value was inserted or changed, and a tag "version" which is updated with each change.
Furthermore the version '0' should always contain the latest value.
name: temperature
-----------------
time write_ts (val) current_mA (val) version (tag) machine (tag)
2015-10-21T19:28:08Z 1445506564 25 0 injection_molding_1
So let's assume I have an updated prognosis value for this example value.
So, I do:
SELECT curr_measurement
INSERT curr_measurement with new tag (version = 1)
DROP curr_mesurement
//then
INSERT new_measurement with version = 0
Now my question:
If I loose the connection in between for whatever reason in between the SELECT, INSERT, DROP:
I would get double records.
(Or if I do SELECT, DROP, INSERT: I loose data)
Is there any method to prevent that?
Transactions don't exist in InfluxDB
InfluxDB is a time-series database, not a relational database. Its main use case is not one where users are editing old data.
In a relational database that supports transactions, you are protecting yourself against UPDATE and similar operations. Data comes in, existing data gets changed, you need to reliably read these updates.
The main use case in time-series databases is a lot of raw data coming in, followed by some filtering or transforming to other measurements or databases. Picture a one-way data stream. In this scenario, there isn't much need for transactions, because old data isn't getting updated much.
How you can use InfluxDB
In cases like yours, where there is additional data being calculated based on live data, it's common to place this new data in its own measurement rather than as a new field in a "live data" measurement.
As for version tracking and reliably getting updates:
1) Does the version number tell you anything the write_ts number doesn't? Consider not using it, if it's simply a proxy for write_ts. If version only ever increases, it might be duplicating the info given by write_ts, minus the usefulness of knowing when the change was made. If version is expected to decrease from time to time, then it makes sense to keep it.
2) Similarly, if you're keeping old records: does write_ts tell you anything that the time value doesn't?
3) Logging. Do you need to over-write (update) values? Or can you get what you need by adding new lines, increasing write_ts or version as appropriate. The latter is a more "InfluxDB-ish" approach.
4) Reading values. You can read all values as they change with updates. If a client app only needs to know the latest value of something that's being updated (and the time it was updated), querying becomes something like:
SELECT LAST(write_ts), current_mA, machine FROM temperature
You could also try grouping the machine values together:
SELECT LAST(*) FROM temperature GROUP BY machine
So what happens instead of transactions?
In InfluxDB, inserting a point with the same tag keys and timestamp over-writes any existing data with the same field keys, and adds new field keys. So when duplicate entries are written, the last write "wins".
So instead of the traditional SELECT, UPDATE approach, it's more like SELECT A, then calculate on A, and put the results in B, possibly with a new timestamp INSERT B.
Personally, I've found InfluxDB excellent for its ability to accept streams of data from all directions, and its simple protocol and schema-free storage means that new data sources are almost trivial to add. But if my use case has old data being regularly updated, I use a relational database.
Hope that clear up the differences.
When previousServerToken is null, CKFetchRecordChangesOperation seems to take several passes to download the first set of data, retrying until the moreComing flag is clear.
It isn't because there are too many records- In my testing I only have around 40 member records, each of which belong to one of the 6 groups.
The first pass gives two badly-formed member records; the second pass sometimes sends a few member records from a group that has not yet been downloaded, or nothing. Only after the third pass does it download all the remaining groups and members as expected.
Any ideas why this might be?
This can happen if the zone has had lots of records deleted in it. The server scans through all of the changes for the zone and then drops changes for records that have been deleted. Sometimes this can result in a batch of changes with zero record changes, but moreComing set to true.
Take a look at the new fetchAllChanges flag on CKFetchRecordZoneChangesOperation in iOS 10/macOS 10.12. CloudKit will pipeline fetch changes requests for you and you'll just see record changes and zone change tokens until everything in the zone has been fetched.
This is the problem it caused, and what I had to do about it...
I have two types of record- groups, and members (which must have a group as their parent.)
The problem is that, although CloudKit normally returns parents of records first, it will only do this within a single batch.
Members might therefore be received before their parent group if it is in a different batch (which can happen if a group has been subsequently edited or renamed, as that moves it later in the processing order)
If you are using arrays on your device to represent the downloaded data, you therefore need to either cache members across a series of batches, and process them at the end (after all groups have been received,) or allow a record to create a temporary dummy group that is overwritten with that groups name and other data when it eventually arrives.
I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)
I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.