Machine learning - Features contain list of values - machine-learning

I have a dataset that contains many features. I have one features that contain a list of values in one data point. It's can be like this :
A B C
1 2 [3,4,5]
So what can we handle features C for recommendation system?. I have known about one hot encoding but my features C doesn't have finite values. C contain ID number of others therefore it can become larger and larger overtime. Is there any solution to deal with this type of features?

From what you described and since you mentioned about recommendation system, I would consider your data set as an example of following:
per row is a user and feature A, B are user personal information for instance and feature C is the items he bought. And of course, feature C doesn't contain the same numbers of items in each row and it can expand.
I would build two different recommendation models and combine them together afterward. One for feature A, B and another is for feature C.
Since feature C evolves with time, you can build the model on regular time base (take the snapshot of the feature C) or as long as some 'event" triggers the building process. For feature C, in my example, is user-item matrices.

Related

Should I change my object variables to integers or create dummy variables?

I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?
Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...
Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.
Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict

Protege Ontology - creating individuals

I'm doing for the first time an ontology in Protege, but I have never worked with it.
I have a manufacturing process, where I have two robots, a machine tool, two storages (S1 and S2), a working table, a computer vision system, a conveyor and 6 types of pieces (A, B, C, D, E, F). I have some goals set (ex: Storage S2 must have a piece of type A in position (row, column) (1,4) with orientation orientation1. I though to create a class for Robot which will have the following properties: hasState (the robot can be free or can have a piece), hasPosition (the robot can be in four predefined positions) and hasPiece.
The question is the following: when I will create the individuals for the two robots, what I will set in the hasPiece properties? I need to create the ontology in Protege and after that, to create a CLIPS program that will resolve the problem(will move the pieces from the storage S1 in storage S2 in the desired positions). Will the individuals be the initial facts? I only saw examples of ontologies for pizza and countries and these didn't have properties that will be modified during CLIPS program running.
Will the individuals be the initial facts?
I would assume so from your description.
Individuals and properties are created the same way no matter how they will be subsequently modified. I would assume that all you need to change from the pizza example is the name of properties, classes and individuals required.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Finding similar users based on String preperties

Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.

Circular Linked list concatenation complexity

Suppose you have two circular linked lists , one is of size M and the other is of size N and M < N. If you don't know which list is of size M, what is the worst-case complexity to concatenate the two lists into a single list?
I was thinking O(M) but that is not correct. And no, I guess there is no specific place to concatenate at.
If there are no further restrictions, and your lists are mutable (like normal linked lists in languages like C, C#, Java, ...), just split the two lists open at whatever nodes you have and join them together (involves up to four nodes). Since it's homework, I leave working out the complexity to you, but it should be easy, there's a strong hint in the preceding.
If the lists are immutable, as would normally be the case in a pure functional language, you'd have to copy a number of nodes and get a different complexity. What complexity exactly would depend on restrictions on the sort of result (e.g. does it have to be a circular linked list?).

Resources