Is Event Sourcing helpful to Machine Learning

Is Event Sourcing helpful to Machine Learning - machine-learning

I am new to Event Sourcing, Event Store, Message Store and Machine Learning.
And we are planning to implement message store and the reason they mentioned about implementing message store (instead traditional db, crud) is because the message store eventually helps in deep learning or machine learning.
I have got basic understanding of event store, CQRS, but unable to understand how it relates to machine learning.

CQRS/Event sourcing
Machine learning has nothing to do with Event Sourcing and CQRS. they are design patterns to segregate read and write data operations and to store all the events that happen to your domain instead of updating the state.
Machine Learning
Machine learning is about data. The more data you have the better predictions you have. Since Event sourcing means that you are going to store each event that happened to your domain means you have more data to analyze and you can predict better results.
Example
I have online shopping store where some people keep the orders in cart for longer period before they make the payment than other people people who make payment straight away. If you have event sourcing in place you can track the behaviour of user like Item added , item removed , booking created etc. You can use machine learning to predict the people who payed straight away are likely to get the product next time and send out discounts to them etc or learn their shopping behavior to show them products that they are more interested in.
Imagine instead of event sourcing you are dealing in state, where you simple update order status field in the data base. You can never predict this behavior.
Hope that helps !

In event sourcing, events/facts are stored in the database. This ensures that any changes to the aggregate are documented in the database. The information that is stored is generally event payload which is in structured format which is beneficial for machine learning.

Related

Automate Solving of customer technical issue Production L3 tickets

I want to develop a app/software which understand text from various input and make Decision according to it. Further if any point the system got confused then user can manual supply the output for it and from next time onwards system must learn to give such output in these scenarios. Basically system must learn from its past experience. The job that i want handle with this system is mundane job of resolving customer technical problems.( Production L3 tickets). The input in this case would be customer problem like with the order( like the state in which order is stuck and the state in which he wants it to be pushed) and second input be the current state order( data retrieved for that order from multiple tables of db) . For these two inputs the output would be the desired action to be taken like to update certain columns and fire XML for that order. The tools which I think would required is a Natural Language processor( NLP) library for understanding text and machine learning so as learn from past confusing scenarios.

If you want to use Java libraries for your NLP Pipeline, have a look at Opennlp.
you've a lot of basic support here.
And then you've deeplearning4j where you've a lot of Neural Network implementations in java.
As you want a Dynamic model which can learn from past experiences rather than a static one, you've a number of neural netwrok implementations which you can play with in deeplearning4j.
Hope this helps!

Massive data operations in the stored proc to DDD

Lets take an example of a product classification. All the products needs to be classified as vegetable or not. The business logic is, the product can be classified as vegetable if that product is from company A, B & C. If the product is not from those companies they are not vegetables. There are millions of products. This can be done in a stored proc with few lines of code. The operation may take only few seconds if it is done synchronizely.
As I understand, the DDD goes against the idea of putting the logic in the stored procedure. The logic can be put as a behavior on product which can self classify based on who is the source. To do this, all the million products need to be read into memory, process and then save it back to the database.
The problem here is the large amount of memory this operation needs. If the operation is done in chucks like 50,000 the repository has to first figure out how may products needs to be classified and should tell the domain the long running operation has to go in chunk. Surely, this approach is going to take more time and a bad user experience for the user who has to wait more time than a process than a stored procedure takes.
What is the reasonable approach to DDD when it comes to long running processes? Is the delay expected, so the app has to inform the user that the classification is going to take time and will let the user know when that is complete? And should not use stored procedure, but have the logic part of the domain.
UPDATE
Just to add some clarity, this classification process is done quite often. The application has to support the classification process, not an ETL or can't wait longer. That's why I'm trying to find the trade offs between using a stored procedure versus DDD.
Also note that it is not a Query, but a Command. The command can be called ClassifyAllProductsCommand(). When this command is run, there was no classification before. After the classification, other users of the system should see the new classification. For example, the product A is classified as Unavailable, and after the classification it can be Vegetable or Meat.

Classification is an interesting thing. It is a separate thing. Classification should never be implemented as structure... but that is another story :)
Your classification may even be regarded as a bounded context in the same way that reporting may be a bounded context. As such you may wish to handle classification separately. Your classification is not an aggregate root. It plays an auxiliary role. If it has no impact on the consistency in your domain modelling it may not even necessarily be part of your Product aggregate. It may be added and it may even be changed independently (not as bulk) but if it is used to determine the validity of your aggregate then your classification sub-system is going to have to take that into account.
Please bear in mind that it isn't a matter of DDD vs a stored procedure. You are executing queries against your data store. Whether that is done via a stored procedure or dynamically should not affect your decision. There is nothing preventing, say, a ProductRepository from calling a stored procedure.
You can have your classification sub-system still execute your SP or use DML directly. However, this isn't necessarily going to be part of your domain. You most certainly do not want to classify each product individually if it is something that happens quite often and as a bulk operation. If your current design dictates that these are bulk operations then keep them as such and don't force them into a DDD structure that is going to be prohibitive.
It is a design choice and sometimes making changes to individual items does not make sense. It should certainly be your aim to work on a single aggregate at a time but things like reporting or classification are another animal that don't always fit cleanly into the Domain-Driven Design thinking.

I think you're confusing DDD. If you were looking for Vegetable type Products, you would call a service that would retrieve Products for a particular Company. There would be no need to load all the products into memory.
Application or domain-centric design, just means designing your application around the business domain and not from a collection of database tables upwards (like a data-centric approach).
In contrast you end up with more data associations (joins) being done in your application and less in monolithic stored procedures. Which moves all your business logic into the application and not in the persistence device (the database), which kinda makes a lot of sense.
Also, if you deny yourself huge table joins then you also think carefully about things that traditionally cause massive overhead on your database and end up moving towards better design, like creating a separate reporting database, message buses, asynchronous tasks, etc.
EDIT
It seems like a common phrase in DDD but "it depends on your specific domain".
Without knowing the detail, I would want to know how often these classifications occur. Can they be done as the Products are created? Are they done often or rarely, planned or unpredictably?
If the classifications are common and must be done across all one million products, it might be best to create a smaller model for the Product, maybe something with just SmallProduct.Id and SmallProduct.CompanyId (probably naming it something better). Then data cache this smaller collection in memory and perform operations against it.
If the check to see if the product is a Vegetable is common and only one of a few possible classifications, it might be best to have Classifications in their own table and a linking table to link them to Products. Then the problem becomes more of a one time data setup issue.
On the rare chance that you're using a Document Database, you could just store these classifications in a collection on the Product object itself.

It seams you are interpreting "classification" as you aggregate root, containing products (as entities).
Honestly, it does not feel like a good design decision (I might be wrong, depends on the requirements specifics).
What if you think of the product as aggregate root (containing suppliers, discounts, etc.)?. In that case, you´ll need to load only one product at a time.
If the classification/supplier has a complex domain, you should consider having a separate bounded context for that.
Also, in your comment:
Just to add some clarity, this classification process is done quite often. The application has to support the classification process, not an ETL or can't wait longer. That's why I'm trying to find the trade offs between using a stored procedure versus DDD.
REALLY? You can´t fire an event and have the product service update the classification when the there´s an update on the supplier? The user will have an inconsistent state (say.. "undefined" category"), for a few seconds/minutes. It is not that bad, is it ?
But, if you are talking about a batch job, then, by all means, go with the stored procedure.

Ecommerce frontend split databases

Until now I've worked on a web app for keeping record of different products from different warehouses in regards to inventories and transactions etc.
I was asked to do an ecommerce front end for selling products from these warehouses and I would like to know how should I approach this problem?
The warehouses web app has a lot of logic and a lot of products and details and I don't know whether to use the same databases(s) for the second app by mingling the data in regards to user mgmt, sales orders and etc.
I've tried doing my homework but for the love of internet I don't even know how to search, if I'm placed on the right track I shall retreat to my cave and study.
I'm not very experienced in this matter and I would like to receive some aid in deciding how to approach the problem, go for a unified database or separated one-way linked datbases and how hard would it be to maintain the second approach if so?

Speaking of warehouses, I believe that is what you should do with your data, e.g. roll each and every disparate data source into a common set of classes/objects that your eCommerce store consumes and deals with.
To that end, here are some rough pointers:
Abstract logic currently within your inventory app into a middle tier WCF Service that both your inventory app and eCommerce app can consume it. You don't want your inventory app to be the bottleneck here.
Warehouse your data, e.g. consolidate all of these different data sources into your own classes/data structures that you control. You will need to do this to create an effective MVC pattern that is maintainable and sustainable. You don't want those disparate domain model inventories to control your view model design.
You also don't want to execute all of that disparate logic every time you want a product to show to the end user, so cache the data in a well indexed, suitable table as described above for high availability that you can get to using Entity Framework or similar. Agree with the business on an acceptable delay and kick off your import/update processes on a schedule.
Use Net.Tcp bindings on your services to move your data around internally. It's quick, it's efficient and there is very little overhead compared to SOAP when dealing in larger data movements.
Depending on scale required, you may also want to consider implementing a WCF Service purely for the back-end of your ecommerce store, that deals only in customer interactions with the underlying warehoused data sources, this could then warrant its own server eventually if the store becomes popular. Also, you could figure in messaging eventually between your SOA components, later down the line.
Profit. No, seriously!
I hope this helps. Good luck!

Using machine learning to de-duplicate data

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.

There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.

Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

Implementing a News Feed / Activity Feed on Several Models - Recommendations?

I'm interested in learning how to implement a News Feed / Activity Feed on a web app for multiple models like Books, Authors, Comments, etc...
Any recommendations from the group? Gems/Plugins, or experience personally or from others on the best/smartest way to proceed?
Thanks!

You don't need any Gem.
Create a new model, e.g. Activity, to store activity details. The module should store at least the activity timestamp, the event (e.g. created, destroyed, published, ...) and the id of the related record (you can even use a polymorphic association if you want)
Create a method which gets in input a record with additional metadata and creates a new activity record
In you controllers, call the method each time you want to keep track of an action, passing the modified record as parameter
Then you'll have a list of Activity records you can easily fetch to display the latest events.

First and foremost, I’d like to be open and say that I am an employee of Stream, an API for building scalable news and activity feeds – much like you would see on Facebook, Instagram, and other social media applications.
From my extensive experience as a developer and consultant and continued research and self-education, Stream’s technology stack is extremely effective or competitive. You can get a news or activity feed up and running in a fraction of the time than it would take you to build out your own infrastructure (Cassandra clusters, queuing mechanisms, etc.).
That being said, I highly recommend checking out Stream. What it really comes down to is buy vs build. You can spend months building out a custom solution, or rely on a proven and scalable platform such as Stream that will offer you everything you need to get up and going, in a fraction of the time.
If you're skeptical, check out the 5 minute tutorial at https://getstream.io/get_started/.
Best of luck!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart