When I turn to the training model, I get an error: "Machine learning model training process failed.:This operation cannot be completed because the model size exceeds the subscription storage size 5 GB." Yesterday I deleted all snapshots, but now storage is 6Gb. Why the storage has not decreased so far?
On service details it´s an alert saying that it´s necessary to wait one hour until to get the new storage status.
Usage is updated approximately once per hour.
You can see that on rigth top corner settings icon, into Manage Service Details.
Related
Especially in case of errors it makes sense to write the current data into the log, so that the errors are easier to trace.
In a generic logging solution these logs could easily contain personal, identity or otherwise sensitive information.
In flowground, How long are these logs stored? and should i be concerned about the sensitivity of the data that is logged?
Currently flowground uses a maximum disk-space oriented approach to limit the maximum number of logs to be stored.
We currently plan to change that to a time-based approach, something between 3 to 5 days maximum.
Do you have a preference concerning the maximum number of days #Stephan Häußler?
Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).
I am ingesting data via pub/sub to a dataflow pipeline which is running in unbounded mode. The data are basically coordinates with timestamps captured from tracking devices. Those messages arrive in batches, where each batch might be 1..n messages. For a certain period there might be no messages arriving, which might be resent later on (or not). We use the time-stamp (in UTC) of each coordinate as an attribute for the pub-sub message. And read the pipeline via a Timestamp label:
pipeline.apply(PubsubIO.Read.topic("new").timestampLabel("timestamp")
An example of coordinates and delay looks like:
36 points wait 0:02:24
36 points wait 0:02:55
18 points wait 0:00:45
05 points wait 0:00:01
36 points wait 0:00:33
36 points wait 0:00:43
36 points wait 0:00:34
A message might look like:
2013-07-07 09:34:11;47.798766;13.050133
After the first batch the Watermark is empty, after the second batch I can see a Watermark in the Pipeline diagnostics, just it doesn't get updated, although new messages arrive. Also according to stackdriver logging PubSub has no undelivered or unacknowledged messages.
Shouldn't the watermark move forward as messages with new event time arrive?
According to What is the watermark heuristic for PubsubIO running on GCD? the WaterMark should also move forward every 2minutes which it doesn't?
[..] In the case that we have not seen data on the subscription in more
than two minutes (and there's no backlog), we advance the watermark to
near real time. [..]
Update to address Bens questions:
Is there a job ID that we could look at?
Yes I just restarted the whole setup at 09:52 CET which is 07:52 UTC, with job ID 2017-05-05_00_49_11-11176509843641901704.
What version of the SDK are you using?
1.9.0
How are you publishing the messages with the timestamp labels?
We use a python script to publish the data which is using the pub sub sdk.
A message from there might look like:
{'data': {timestamp;lat;long;ele}, 'timestamp': '2017-05-05T07:45:51Z'}
We use the timestamp attribute for the timestamplabel in dataflow.
What is the watermark stuck at?
For this job the watermark is now stuck at 09:57:35 (I am posting this around 10:10), although new data is sent e.g. at
10:05:14
10:05:43
10:06:30
I can also see that it may happen that we publish data to pub sub with delay of more than 10 seconds e.g. at 10:07:47 we publish data with a highest timestamp of 10:07:26.
After a few hours the watermark catches up but I cannot see why it is delayed /not moving in the beginning.
This is an edge-case in the PubSub watermark tracking logic that has two work arounds (see below). Essentially, if there is no input for 2 minutes, then the watermark will advance to the current time. But, if data is arriving faster than every 2 minutes but still at a very low QPS, then there isn't enough data to have a keep the estimated watermark up to date.
As I mentioned, there are several work arounds:
If you process more data the issue will naturally be resolved.
Alternatively, if you inject extra messages (say 2 per second) it will provide enough data for the watermark to advance more quickly. These just need to have timestamps, and may be immediately filtered out of the pipeline.
For the record, another thing to have in mind about the previously mentioned edge cases in a direct runner context, is the parallelism of the runner. Having a higher parallelism, which is default especially on multicore machines, seems to need even more data. In my case a setting --targetParallelism=1 helped. Basically transformed a stuck pipeline to in a working one without any other intervention.
I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.
MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.
If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !
I'm familiar with using Reachability to determine the type of internet connection (if any) being used on an iOS device. Unfortunately that's not a decent indicator of connection quality. Wifi with low signal strength is pretty sketchy and 3G with anything less than 3 bars is a disaster (not to mention networks that only allow EDGE connections).
How can I determine the quality of my connection so I can help my users decide if they should be downloading larger files on their current connection?
A pragmatic approach would be to download one moderately large-sized file hosted on a reliable, worldwide CDN, at the start of your application. You know the filesize beforehand, you just have to measure the time it takes, make a simple computation and then you've got your estimate of the quality of the connection.
For example, jQuery UI source code, unminified, gzipped weighs roughly 90kB. Downloading it from http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.14/jquery-ui.js takes 327ms here on my Mac. So one can assume I have at least a decent connection that can handle approximately 300kB/s (and in fact, it can handle much more).
The trick is to find the good balance between the original file size and the latency of the network, as the full download speed is never reached on a small file like this. On the other hand, downloading 1MB right after launching your application will surely penalize most of your users, even if it will allow you to measure more precisely the speed of the connection.
Cyrille's answer is a good pragmatic answer, but is not really in the end a great solution in the mobile context for these reasons:
It involves doing a test "at the start of your application" by which I assume he means when your app launches. But your app may execute for a long while, may go background and then back into the foreground, and all the while the user is changing network contexts with changes in underlying network performance - so that initial test result may bear no relationship to the "current" performance of the network connection.
For the reason he rightly points out, that it is "penalizing" your user by making them download a test file over what may already be constrained network conditions.
You also suggest in your original post that you want your user to decide if they should download based on information you present to them. But I would suggest that this is not a good way to approach interacting with mobile users - that you should not be asking them to make complicated decisions. If absolutely necessary, only ask if they want to download the file if you think it may present a problem, but keep it that simple - "Do you want to download XYZ file (100 MB)?" I personally would even avoid even that.
Instead of downloading a test file, the better solution is to monitor and adapt. Measure the performance of the connection as you go along, keep track of the "freshness" of that information you have about how well the connection is performing, and only present your user with a decision to make if based on the on-going performance of the connection it seems necessary.
EDIT: For example, if you determine a patience threshold that in your opinion represents tolerable download performance, keep track of each download that the user does in order to determine if that threshold is being reached. That way, instead of clogging up the users connection with test downloads, you're using the real world activity as the determining factor for "quality of the connection", which is ultimately about the end-user experience of the quality of the connection. If you decide to provide the user with the ability to cancel downloads, then you have an excellent "input" about the user's actual patience threshold, and can adapt your functionality to that situation, by subsequently giving them the choice before they start the download. If you've flipped into this type of "confirmation" mode, but then find that files are starting to download faster, you could dynamically exit the confirmation mode.
Rob's answer is very good, but for a more specific implementation start with (https://developer.apple.com/library/archive/samplecode/SimplePing/Introduction/Intro.html#//apple_ref/doc/uid/DTS10000716)Apple's Simple Ping example source code
Target the domain for the server that you want to monitor connection quality to. Use the ping library to "ping" it on a regular basis (say 1 or 10 seconds depending upon your UI needs). Measure how long it takes to get a response to your ping (or if it never returns) to develop an estimate of the connection quality to communicate to your user.