My dataset contains features that, if present, can have other features associated. To make an example:
Feature A: 0/1
Feature B: doesn't exist if A = 0, else: 1/-1
Feature C: doesn't exist if A = 0, else: 1/-1
Those features are not absent, they simply don't make sense if "Feature A" is set to 0 so I can't really use data imputation. What is the best way to integrate these features in my dataset? The information is valuable and if possible I would like not to discard it.
If you are working with linear model (like linear SVM) then simply put "0" for this feature. While -1 and +1 values lead to the use of a particular weight assigned by the model, using "0" means that it will ignore the weight. It becomes much more complex once you consider kernel spaces and I do not think you can make an easy solution for such problem then.
Related
I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.
However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.
Am I missing some kind of approach that could be used here to help reduce complexity?
I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:
for f_idx in range(X_subset.shape[1]):
sorted_values = X_subset.iloc[:, f_idx].sort_values()
for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
if threshold is not None:
G = calc_g(y_subset, y_left, y_right)
if G < tr_G:
threshold = v
feature_idx = f_idx
tr_G = G
else:
threshold = v
feature_idx = f_idx
tr_G = G
return feature_idx, threshold
So, since no one answered, here some stuff I found out.
Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".
This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.
As for the regularization tricks, here are couple of I used myself:
limit the maximum depth of tree
limit the minimal amount of items in node
limit the maximimum amount of leafes in tree
limit the minimum quiality change in split criteria after performing an optimal split
For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:
Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
Use data binning and bucket sort for O(n) sorting
Source of info: https://ml-handbook.ru/chapters/decision_tree/intro
(use google translate, since website is in russian)
Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].
I am attempting to use reinforcement learning to choose the closest point to the origin out of a given set of points repeatedly, until a complex (and irrelevant) end condition is reached. (This is a simplification of my main problem.)
A 2D array containing possible points is passed to the reinforcement learning algorithm, which makes a choice as to which point it thinks is the most ideal.
A [1, 10]
B [100, 0]
C [30, 30]
D [5, 7]
E [20, 50]
In this case, D would be the true best choice. (The algorithm should ideally output 3, from the range 0 to 4.)
However, whenever I train the algorithm, it seems to not learn what the "concept" is, but instead just that choosing, say, C is usually the best choice, so it should always choose that.
import numpy as np
import rl.core as krl
class FindOriginEnv(krl.Env):
def observe(self):
return np.array([
[np.random.randint(100), np.random.randint(100)] for _ in range(5)
])
def step(self, action):
observation = self.observe()
done = np.random.rand() < 0.01 # eventually
reward = 1 if done else 0
return observation, reward, done, {}
# ...
What should I modify about my algorithm such that it will actually learn about the goal it is trying to accomplish?
Observation shape?
Reward function?
Action choices?
Keras code would be appreciated, but is not required; a purely algorithmic explanation would also be extremely helpful.
Sketching out the MDP from your description, there are a few issues:
Your observation function appears to be returning 5 points, so that means a state can be any configuration of 10 integers in [0,99]. That's 100^10 possible states! Your state space needs to be much smaller. As written, observe appears to be generating possible actions, not state observations.
You suggest that you're are picking actions from [0,4], where each action is essentially an index into an array of points available to the agent. This definition of the action space doesn't give the agent enough information to discriminate what you say you'd like it to (smaller magnitude point is better), because you only act based on the point's index! If you wanted to tweak the formulation a bit to make this work, you would define an action to be selecting a 2D point with each dimension in [0,99]. This would mean you would have 100^2 total possible actions, but to maintain the multiple choice aspect, you would restrict the agent to selecting amongst a subset at a given step (5 possible actions) based on its current state.
Finally, the reward function that gives zero reward until termination means that you're allowing a large number of possible optimal policies. Essentially, any policy that terminates, regardless of how long the episode took, is optimal! If you want to encourage policies that terminate quickly, you should penalize the agent with a small negative reward at each step.
I have a stream of data (e.g. 3D position) generating by a system which it looks like:
(pos1, time1) (Pos2, time2) (pos3, time3) ...
I want to use a machine learning technique to estimate the likelihood (or detect) of a particular event from given stream of data.
What I have done:
I've tagged my data at every frame by YES if the event occurred at that frame, otherwise it is set to NO.
(pos1, time1, NO) (Pos2, time2, Yes) (pos3, time3, NO) ...(posK, timeK, Yes)...
set a window length like L to train model by giving L consecutive frames and the corresponding tag is set by the tag of the last element on that window:
(pos1, Pos2, pos3, NO)
(pos2, Pos3, pos4, NO)
(pos3, Pos4, pos5, NO)
...
(posK-2, PosK-1, posK, YES)
...
Finally, I trained my model by this set of that.
For Testing, I concatenate L consecutive frames and ask the model to find the corresponding tag for this set of data (e.g. YES or NO).
I realize that occurrence of "NO" is a lot more frequent that "YES". Simply because the system is mostly on idle state and I have no event. So it affects on the training.
Could you give me some hints:
1) what type of machine learning model is the best fit for this problem.
2) At the moment I am classifying the output either "YES" or "NO" but I would like to have the probability of occurrence of the event at anytime. What kind of model is do you suggest?
Thanks
I think there are actually two questions, here: how to build the dataset, and which predictor to use.
For building the dataset, at some time point i, make sure to choose the ℓ instances happening before i (the phrasing in your question made it seem that you're choosing the one including i). The label of the outcome should be the one at i, though. After all, you're attempting to predict the future based on the present, no? Predicting the present based on the present is rather easy.
Another point is how to choose ℓ, or even whether to choose a single ℓ. Note that if you choose a number of different values of ℓ, then you get a multivariate model.
Finally, the question you directly asked is which predictor to use. This is too wide to answer without knowing your dataset (and playing with it). You might want to read about the bias-variance tradeoff to see why there is no "best" predictor for some problem.
Having said that, I'd suggest that you start with logistic regression which is a simple and robust classifier that also outputs probabilities (as you asked).
Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.