Using rapidminer to handle an one-to-many classification - machine-learning

I am relatively new to the area and I am trying to solve a classification problem using Rapidminer. I was given a dataset with visits to the doctor and I have to detect readmission cases however since It was originally in a table with some one-to-many relationships I have several lines relative to the same visit for each of the different medicament prescribed.
Example:
Consult_ID| Patient_ID | Medic_ID | MedicamentPrescribed |Readmission
133 | 9893 | 23 | Med_X |YES
133 | 9893 | 23 | Med_Y |YES
The format given is out of my hands unfortunately, so I have to work with it. I want to know if there is any standard solution (maybe contained in Rapidminer itself) to solve a problem like this?
The only way I can think of is creating a new table with the visits, add each possible medicament as a new feature and then indicate if it was prescribed or not but I am not happy with it since it would be an absurdly high number of features (518 distinct medicaments) with NULL fields.
I could also concatenate all the medicament in a single column (Med_X,Med_Y) but I would loose a lot of information in the process since it would consider this new string as it's own medicament.

Related

Entity relashionship with 1-to-1 relation on both sides

When modeling with entity-association diagram, I had this relationship which I found weird, I wonder if it's allowed in this model.
+----------+ =-------= +-------------+
+ Driver +--1,1-= HAS =---1,1---+ performance +
+----------+ =-------= +-------------+
Two entities Vehicle and performance and relationship has.
So A Driver must have one and only performance and vice versa.
I wonder if i must merge these two entities into one entity, but semantically this seems wrong, a Driver is not a performance.
In application level, Performance allow to give a points to a driver in order to class Drivers.
This is vaguely similar to a bridge table, which would be used to represent a many-to-many relationship:
drivers ----- many / many ----- performance
transformed into:
driver ----- 1 / many ----- bridge table ----- many / 1 ----- performance
but that is not quite what is happening here.
In this case, the middle table seems to play no useful role and should probably be removed. Assuming that performance is a child table of driver (so performance records include a foreign key that references driver), then it would likely ok to remove performance completely and add those columns directly to driver.
If there are other tables that reference performance such that it would be burdensome to remove that table, you could keep it as a separate table, but just make the references point directly to driver.

Generate an ID (non-editable by user) for a new row created in decision table

1. Is there a way to generate an ID in a decision table when a new row is added during Rules authoring.
Say a Decision Table has 2 Offers configured.
<BR>Offer_Name | Offer_id | Offer_expiration_date | offer_type | offer_group<BR>
Offer1 | 1 | 12-31-2019 | DOLLAR | DISCOUNT<BR>
Offer2 | 2 | 12-31-2030 | DOLLAR | DISCOUNT
If a Business User goes and adds a new row to the decision table, a new row should appear with the Offer_id already populated with a value - 3.
2. and can this value/column be made non-editable by the user?
Re: 1
This is not a standard feature that ODM supports. The purpose of a Decision Table is to filter a set of existing objects based on the values specified in the columns of the Decision Table, then to apply some actions to either update the resulting objects or perhaps create other objects. In either case, it needs a list of existing objects to work from. Many (!) users of ODM would like ODM to provide what I call a Data Table, whose purpose is to specify and create a set of objects with the values specified in the columns of the Data Table. Alas, ODM does not provide such a feature, and in the past has intentionally refused to consider such a feature. Your question does not differentiate between Condition Columns and Action Columns, which leads me to believe you are hoping for a Data Table (which does not exist).
Usually, it is possible to re-think your requirement into Condition-Action terms. In the worst case, all rows can share a trivial condition (true = true) and everything else can happen in the actions (such as creating instances). If you are using a Java XOM (and you should be!), you can implement the offer_id functionality behind the scenes in Java.
Re: 2
Older versions of ODM supported Decision Table templates, which allowed a developer to lock certain aspects of the Decision Table from the rule author. That feature is now deprecated (since 8.9, I believe), and there is no replacement for it.

DynamoDB avoiding SCAN for time-series dataset

I'm interested in counting user interactions with uniquely identifiable resources between two points in time.
My use cases are:
Retrieve the total count for an individual resourceId (between time x and time y)
Produce a list of the top resourceIds ordered by count (between time x and time y)
Ideally I'd like to achieve this using DynamoDB. Sorting time series data in dynamo looks to have it's challenges and I'm running into some anti-best-practices whilst attempting to model the data.
Data model so far
A downsampled table could look like this, where count is then number of interactions with a resourceId within the bounds of a timebin.
| resourceId | timebin | count |
|---------------|-----------|-------|
|(Partition Key)| (Sort Key)| |
The total interaction count for each resource is the sum of the count attribute in each of items with the same resourceId. As an unbounded "all time" count is of interest, older events will never become obsolete, but they can be further downsampled and rolled into larger timebins.
With the above schema use case 1 is fulfilled by queuing a resource using it's hash key and enforcing time constraints using the sort key. The total count can then be calculated application side.
For use case 2, I'm looking to achieve the equivalent of an SQL GROUP BY resourceId, SUM(count). To do this the database needs to return all of the items that match the provided timebin constraints, regardless of resourceId. Grouping and summing of counts can then be performed application side.
Problem: With the above schema a full table scan is required to do this.
This is obviously something I would like to avoid.
Possible solutions
Heavily cache the query for use case 2, so that scan is used, but only rarely (eg once a day).
Maintain an aggregate table, with for example, predefined timeRanges as the Partition Key and the corresponding count as the Sort Key.
i.e.
| resourceId | timeRange (partition) | count (sort) |
|------------|------------------------|--------------|
| 1234 | "all_time" | 9999 |
| 1234 | "past_day" | 533 |
Here, "all_time" has a fixed FROM date, so could be incremented each time a resourceId event is received. "past_day", however, has a moving FROM date so would need to be regularly re-aggregated using updated FROM and TO markers.
My Question
Is there a more efficient way to model this data?
Based on your description of the table with the resourceId being the hash key of the table, if you are performing aggregations within a single hash key this can be accomplished with a query. Additionally if timebin, the range key, can be compared using greater than and less than operators you will be able to directly get to the records that you want with an efficient query and then sum up the counts on the application side.
However, this will not accomplish your second point so additional work will be required to meet both requirements.
Maintaining an aggregate table seems like the logical approach for a global leader board. I'd recommend using DynamoDB Streams with AWS Lambda to maintain that aggregate table in near-real-time. This follows the AWS best practices.
The periodic scan and aggregate approach is also valid and depending on your table size may be more practical since it is a more straight forward to implement, but there are a number of things to watch out for...
Make sure the process that scans is separate from your main application execution logic. Populating this cache in real time would not be practical. Table scans are only practical for real time requests if the number of items in the table is just a few hundred or less.
Make sure you rate limit your scan so that this process doesn't consume all of the IOPS. Alternatively you could substantially raise the IOPS during this time period then lower then back once the process completes. Another alternative would be to make a GSI that is as narrow as possible to scan, dedicating the GSI to this process would avoid needing to rate limit as it could consume all of the IOPS it wants without impacting other users of the table.

Core Data multiple instances of same Entity - entities that share attributes

I am trying to wrap my head around how to have multiple instances of the same Core Data entity. It does not seem possible, so I must be approaching this wrong.
Basically, say I have a shopping cart that can be full of multiple balloons. But each balloon can have a different color. If I edit the template for the balloon, all the balloons will update to reflect the change. So say I change the template's name to 'bacon', all the balloon's names will change to 'bacon' as well.
How would I go about achieving this with Core Data?
EDIT
As requested I will try to clarify what I am trying to do.
Maybe this example will be more clear.
Say you are creating a model for exercises. So you have Ab Roller, Shoulder Press, etc.
In a workout, you may have multiple instances of each. So in one workout you will have, say
Ab Roller
Shoulder Press
Ab Roller
And each instance of Ab Roller would have its own relationship to Sets which would be different for each of course.
Maybe not the best example but should give a clearer understanding of repeating instances.
I was thinking of having a template entity, and then an instance entity, and a relationship between them - when template entity name is updated, all instance entity's name update through KVO. Or I place all the shared attributes (i.e. name) in the relationship (so the instance entity's name attribute returns its template's name attribute) so that they reflect the changes to the template. What is the best way to go about it?
I'm going to answer this from a database design point of view, given the agreement in the comments that this is more a generic database design question. If that doesn't clear up all of your questions then hopefully someone who knows the ins and outs of Core Data can clear that side of things up for you.
You're looking at holding some configuration data for your system, and then also holding data for the various instances of entities which use that configuration data. The general pattern you've come up with of having a template entity (I've also seen this called a definition or configuration entity) and an instance entity is certainly one I've come across before, and I don't see a problem with that.
The rules of database normalization tell you to avoid data replication in your database. So if your template entity has a name field, and every single instance entity should have the same name, then you should just leave the name in the template entity and have a reference to that entity via a foreign key. Otherwise, when the name changes you'd have to update every single row in your instance table to match - and that's going to be an expensive operation. Or worse, it doesn't get updated and you end up with mismatched data in your system - this is known as an update anomaly.
So working on the idea of a shopping cart and stock for some sort of e-commerce solution (so like your first example), you might have a BasketItem entity, and an ItemTemplate entity:
ItemTemplate:
* ItemTemplateId
* Name
BasketItem:
* BasketItemId
* ItemTemplateId
* Color
Then your balloon template data and the data for your balloon instances would look like this in the database:
ItemTemplate:
| ItemTemplateId | Name |
| 7 | Balloon |
BasketItem:
| BasketItemId | ItemTemplateId | Color |
| 582 | 7 | Blue |
| 583 | 7 | Green |
(This is obviously massively simplified, only looking at that one specific example and ignoring all of the mechanics of the basket and item, so don't take that as an actual design suggestion.)
Also, you might want to hold more configuration data, and this could drastically change the design: for instance, you might want to hold configuration data about the available colors for different products. The same concepts used above apply elsewhere - if you realise you're holding "Blue" over and over and realise you might in future want to change that to "Dark blue" because you now stock multiple shades of blue balloon, then it makes sense to have the color stored only once, and then have a foreign key pointed to wherever it is stored, so that you're not carrying out a massive update to your entire BasketItem table to update every instance of "Blue" to "Dark blue."
I would highly recommend doing some reading on database design and database normalization. It will help answer any questions you have along these lines. You might find that in some situations you need to break the rules of normalization - perhaps so that an ORM tool works elegantly, or for performance reasons - but it's far better to make those decisions in an informed manner, knowing what problems you might cause and taking further steps to prevent them from happening.

Handling A Search With No Results

There is a great pattern for handling searches with few or no results due to a user having constrained the search with too many filters. It involves showing all the components of a search query with the numbers of hits for that component alone, or for each combination of components.
For example, if I was searching a database of music and I built my search query from the following criteria:
Producer - Martin Hannett
Genre - Electronica
Year - 1977
I would get get no matches as the three don't coincide. However removing a component does allow for matches.
So rather than just displaying 'No Results', a much better handing of this situation would be to present the number of hits for each component of the the query:
Martin Hannett | Electronica | 1975 (2 Results)
Martin Hannett | Electronica | 1975 (33 Results)
Given that a search term might have multiple components, how would this be done efficiently in terms of queries? To get numbers for each component, separate queries would need to be performed for each, meaning the queries would be inefficient? I'm using Rails with Postres and PGSearch, but I think this question is much more general.
You can take advantage of a key value store such as Redis in front of your PostgreSQL database to return the count of search results from Redis. Redis is optimized for fast random read/writes.
Ryan Bates did an episode on autocomplete search to prevent multiple queries to the main database. This case is similar.

Resources