Dask: Update published dataset periodically and pull data from other clients - dask

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.
Would that be possible?
Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?
Thanks for any tips and advice.

You have a few options here.
You might take a look at the streamz project
You might take a look at Dask's coordination primitives
What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.

Related

Does putting multiple time-series in one BigTable table avoid hotspotting?

Say I have 3 unrelated time-series. Each written row key starts with the current timestamp: timestamp#.....
Having each time-series in a separate table will cause hotspotting because new rows are always added at one extremity (latest timestamp).
If we join all 3 time-series in one BigTable table with prefixes:
series1#timestamp#....
series2#timestamp#....
series3#timestamp#....
Does that avoid hotspotting? Will each cluster node handle one time-series?
I'm assuming that there are 3 nodes per cluster and that each of the 3 time-series will receive similar load and will grow in size evenly.
If yes, is there any disadvantage to having multiple unrelated time-series in one BigTable table?
Because you have a timestamp as the first part of your rowkey, I think you're going to get hotspots either way.
In a Bigtable instance, your data is split into groups of contiguous rowkeys (called tablets) and those are distributed evenly across the nodes. To maximize efficiency with Bigtable, you need that data to be distributed across the nodes and within the nodes as tablets. You get hotspotting when you're writing to the same row or contiguous set of rows since that is all happening within one tablet. If you are constantly writing with the timestamp as a prominent part of the key, you will keep writing to the same tablet until it fills up and you have to go to the next one rather than writing to multiple tablets within a node.
The Bigtable documentation has a guide for time-series schema design which recommends a few solutions for a use case like yours:
Field promotion: add an additional field to the rowkey before your timestamp to separate out a group of data (USER_ID#timestamp#...)
Salting: take a hash of the timestamp and divide it by the number of nodes, then add that to the rowkey (SALT_RESULT#timestamp#...)
Reverse timestamps: or if either of those don't work, reverse the timestamp. This works best if your most common query is for the latest values, but can make other queries more difficult
Edit:
Your approach is definitely similar to salting, but since your data is already in separate tables you're actually not getting any increased benefit since the hotspotting is going to be caused at the tablet level.
To draw it out more, let's say you have this data in separate tables and start writing data. Each table is going to be composed of tablets, which capture timestamps 0-10, 11-20, etc... Those tablets will automatically be distributed amongst nodes for the best performance. If the loads are all similar, tablets 0-10 should all be on separate nodes, 11-20 will all be on separate nodes etc.
With the way your schema is set up, you are constantly writing to the latest tablet (let's say the time is now 91,) you're only writing to the 91-100 while ignoring all the other tablets within that node. Since that 91-100 tablet is the only one getting work instead of the other tablets, your node isn't going to give you optimized performance and this is what we refer to as hotspotting. A certain tablet is getting a spike, but there wont be enough time for the load balancer to correct it.
If you have it in the same table, we can just focus on one node now. series1#0-10 will first get slammed, then series1#11-20, then series1#21-30. There is always one tablet that is getting too much load and not making use of the full node.
There is some more information about load balancing in the documentation.

How are Dataflow bundles created after GroupBy/Combine?

Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

Not able to create relationships on huge data with Neo4j shell

I've data about 20 million rows. The shell is going to infinite execution mode when firing create relationship queries. Is there any better approach to handle large data?
If you're importing from CSV, you can use add USING PERIODIC COMMIT to your import. This is supposed to allow import in the area of millions of records per second. This should prevent memory from being eaten up as it batches the import for you.

Parse large and multiple fetches for statistics

In my app I want to retrieve a large amount of data from Parse to build a view of statistics. However, in future, as data builds up, this may be a huge amount.
For example, 10,000 results. Even if I fetched in batches of 1000 at a time, this would result in 10 fetches. This could rapidly, send me over the 30 requests per second limitation by Parse. Specifically when several other chunks of data may need to be collected at the same time for other stats.
Any recommendations/tips/advice for this scenario?
You will also run into limits with the skip and limit query variables. And heavy weight lifting on a mobile device could also present issues for you.
If you can you should pre-aggregate these statistics, perhaps once per day, so that you can simply directly request the details.
Alternatively, create a cloud code function to do some processing for you and return the results. Again, you may well run into limits here, so a cloud job may meed your needs better, and then you may need to effectively create a request object which is processed by the job and then poll for completion or send out push notifications on completion.

Implementing offline Item based recommendation using Mahout

I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.
MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.
If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !

Resources