Let's say I'm building an app like Uber and I want to predict the user's most likely destination based on the user's past history, current latitude-longitude, and current date and time.
We have millions of users, but each user needs will be probably too unique for generalization. I.e., everyone's commute is so different that what you have learned from other users are probably not applicable to other users.
In conclusion, I have to build millions of models, how can I merge these models together for a better deployment process? If not, what are the best practices to deploy millions of models?
We have millions of users, but each user needs will be probably too unique for generalization.
You don't need to build millions of models. Create one model and Personalized it for every/each segment of users. As an example, Google personalized its applications such as Search engine, Youtube, and Play store based on users behaviors. Personalization not only can be for a single user but a group of them. For instance, Male and Female in Fitness application need to get different pieces of advice. In this case, it assumes you have two kinds of users.
What does it mean by behaviors? Your history of search, your information in Gmail account, your sessions, and many other things usually considered as user behavior.
Suggesting personalized information to the users is a well-known research problem which mainly recognized as Recommender System.
I have to build millions of models, how can I merge these models
together for a better deployment process
Merging models is another area of research called Ensembling learing. We usually ensemble 2 to 10 models but not millions.
Related
I am trying to answer the below question given by the business (The business generates revenue from multiple apps through customer pay model) The business is interested in the below questions
new users (trend with respect to previous months)
daily active users
Day 1 retention
I came up with the below DM
Dimension: users, app, deviceid, useractions, plan, date
Fact: fact_activity(userid, appid,deviceid, actionid)
Actions could be: app installed, app launch, registered, completed purchase, postedcomments, playgame etc
The questions I have is
should the fact table contain action_type instead of the actionid into the fact (to avoid join with useractions)
Definition of day 1 retention: No of apps installed/ app launches next day how do to avoid multiple counting of single user using multiple devices
Would it be advisable to have device details in the user dimension
or separate.
If I need to measure average session duration, should I use another fact at session level or tweak the activity fact?
your questions are really unanswerable without significant more information about your business processes, data definitions, etc. In effect, you are asking someone to design a dimensional model for you before they can answer your questions - which is obviously not going to happen.
However, I can give you some very generic pointers that may help you:
Dimensions
A Dimension describes an entity so if attributes can't be described as belonging to the same entity then they shouldn't be in the same dimension. In your case, I assume a Device and a User are not the same thing and therefore they need to be separate dimensions
Facts
You need to define your measures i.e. what precisely are the things you are going to want to aggregate (count, sum, avg, etc) and how are they defined/calculated.
For each measure, you also need to define its grain i.e. what is the minimum set of dimensions that uniquely identify it. Once you have the grain defined, if multiple measures have the same grain then they can be held in the same fact table and if they don't then they can't
I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.
Until now I've worked on a web app for keeping record of different products from different warehouses in regards to inventories and transactions etc.
I was asked to do an ecommerce front end for selling products from these warehouses and I would like to know how should I approach this problem?
The warehouses web app has a lot of logic and a lot of products and details and I don't know whether to use the same databases(s) for the second app by mingling the data in regards to user mgmt, sales orders and etc.
I've tried doing my homework but for the love of internet I don't even know how to search, if I'm placed on the right track I shall retreat to my cave and study.
I'm not very experienced in this matter and I would like to receive some aid in deciding how to approach the problem, go for a unified database or separated one-way linked datbases and how hard would it be to maintain the second approach if so?
Speaking of warehouses, I believe that is what you should do with your data, e.g. roll each and every disparate data source into a common set of classes/objects that your eCommerce store consumes and deals with.
To that end, here are some rough pointers:
Abstract logic currently within your inventory app into a middle tier WCF Service that both your inventory app and eCommerce app can consume it. You don't want your inventory app to be the bottleneck here.
Warehouse your data, e.g. consolidate all of these different data sources into your own classes/data structures that you control. You will need to do this to create an effective MVC pattern that is maintainable and sustainable. You don't want those disparate domain model inventories to control your view model design.
You also don't want to execute all of that disparate logic every time you want a product to show to the end user, so cache the data in a well indexed, suitable table as described above for high availability that you can get to using Entity Framework or similar. Agree with the business on an acceptable delay and kick off your import/update processes on a schedule.
Use Net.Tcp bindings on your services to move your data around internally. It's quick, it's efficient and there is very little overhead compared to SOAP when dealing in larger data movements.
Depending on scale required, you may also want to consider implementing a WCF Service purely for the back-end of your ecommerce store, that deals only in customer interactions with the underlying warehoused data sources, this could then warrant its own server eventually if the store becomes popular. Also, you could figure in messaging eventually between your SOA components, later down the line.
Profit. No, seriously!
I hope this helps. Good luck!
I am the primary developer on a Rails application that allows customers to manage artist portfolio websites. There are four very different types of behavior that the system entails:
Admin behavior, which allows logged-in users to manage the content of
their portfolios
Portfolio Display behavior, which renders users'
portfolios to general visitors
Conversion Funnel behavior, which
displays information about the service and entices new users to sign
up
Super Admin behavior, which displays statistics about the other
three behaviors to the service owners
Currently, these four sets of behaviors are broken up into namespaced controllers—but they all share the same models. I am wondering, since the four parts of the system share virtually no behavior, if there would be any benefit to splitting the application up into four separate applications or engines (and extracting any shared behavior into gems.)
As examples, the part of the system that deals with user statistics doesn't need to "know about" rendering YouTube embeds in portfolios, the part of the system that deals with displaying portfolios doesn't need to "know about" A/B testing, and the part of the system that deals with signing up new users doesn't need to "know about" much else besides signing up new users.
Additionally, a specific problem that I'd like to address is that I'd like for an inexperienced team member to be able to contribute a little bit of code to the part of the site that deals with signing up new users. It wouldn't be the end of the world if a bug or two got pushed to production in that part of the site but it's extremely important to not allows bugs in production in the part of the system that displays users' portfolios.
So, in terms of the maintainability and legibility of the codebase, would it make sense to separate these four components into separate applications? To what extent would doing so simply entail pushing complexity into a different layer of the system without eliminating it? Is this type of separation better accomplished by creating more cleanly decoupled classes and modules?
Thanks much!
A semi-isolated engine would achieve your goals of allowing for silo code work and you can easily shares all of the resources, such as stylesheets (which is likely important since you still offer one service).
A mountable engine also achieves your goals, but the sharing is more difficult, since it seems best to serve as isolated.
A service-oriented architecture with a number of functionally-oriented apps delivering APIs to one or more front-end apps may work. It'd be more work upfront, but it could payoff in the long run if you see the total service getting more complex over time.
Fun choices!
I am building an ad analytics tool which assumes a data structure like this:
Account
Campaign
Keyword
Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.