machine learning model different inputs - machine-learning

i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.

Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further

Related

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

How to preprocessing with the fix-length list?

I want to train my regression model use sklearn with the following data, and use it to predict the revenue given by other parameters:
But I met some problem when I try to fit my model.
from sklearn import linear_model
model = linear_model.LinearRegression()
train_x = np.array([
[['Tom','Adam'], '005', 50],
[['Tom'], '001', 100],
[['Tom', 'Adam', 'Alex'], '001', 150]
])
train_y = np.array([
50,
80,
90
])
model.fit(train_x,train_y)
>>> ValueError: setting an array element with a sequence.
I have done some search, The problem was that train_x did not have the same number of elements in all the arrays(staff_id).
And I think maybe I should add some additional elements into some arrays to make the length consistent. But I have no idea how to do this step exactly. Does this call "vectorize"?
Machine learning models can't take such lists as inputs. It will consider your lists as a list of list of chars (because your list contains strings and each string is a sequence of chars) and probably won't learn anything.
Usually, arrays are used as inputs for models that deal with time-series data, for example in NLP, each record is a timestamp containing a list of words to be processed.
Instead of padding the arrays to be with the same size (as you suggested), you should "explode" your lists into different columns.
Create 3 more columns - one for each staff name: Tom, Adam, and Alex. The value for their cells would be 1 if the name appears in the list or 0 otherwise.
So your table should look like this:
-------------------------------------------------------------------
staff_Tom | staff_Adam | staff_Alex | Manager_id | Budget | Revenue
-------------------------------------------------------------------
1 | 1 | 0 | 5 | 50 | 50 |
1 | 0 | 0 | 1 | 100 | 80 |
1 | 1 | 1 | 1 | 150 | 90 |
....
1 | 0 | 1 | 1 | 75 | ? |
Your model will easily know and identify each staff member and will converge to a solution much quicker.

prepare clickstream for k-means clustering

i'm new to machine learning algorithms and i'm trying to do a user segmentation based on the users clickstreams of a news website. i have prepared the clickstreams so that i know which user id read which news-category and how many times.
so my table looks something like this:
-------------------------------------------------------
| UserID | Category 1 | Category 2 | ... | Category 20
-------------------------------------------------------
| 123 | 4 | 0 | ... | 2
-------------------------------------------------------
| 124 | 0 | 10 | ... | 12
-------------------------------------------------------
i'm wondering if the k-means works well for so many categories? would it be better to use percentages instead of whole numbers for the read articles?
so e.g. user123 read 6 articles overall - 4 of 6 were category 1 so its 66,6% interest in category 1.
another idea would be to pick the 3 most-read categories of each user and transform the table to something like this whereby Interest 1 : 12 means that the user is most interested in Category 12
-------------------------------------------------------
| UserID | Interest 1 | Interest 2 | Interest 3
-------------------------------------------------------
| 123 | 1 | 12 | 7
-------------------------------------------------------
| 124 | 12 | 13 | 20
-------------------------------------------------------
K-means will not work well for two main reasons:
It is for continuous, dense data. Your data is discrete.
It is not robust to outliers, you probably have a lot of noisy data
well, the number of users is not defined because it's a theoretical approach, but because it's a news website let's assume there are millions of users...
would there be another, better algorithm for clustering user groups based on their category interests? and when i prepare the data of the first table so that i have the interest of one user for each category in percentage - the data would be continuous and not discrete - or am i wrong?

SVM Machine Learning: Feature representation in LibSVM

Im working with Libsvm to classify written Text. (Genderclassification)
Im having Problems understanding how to create Libsvm Training data with multiple features.
Training data in Libsvm is build like this:
label index1:value1 index2:value2
Lets say i want these features:
Top_k words: k Most used words by label
Top_k bigrams: k Most used bigrams
So for Example the count would look like this:
Word count Bigram count
|-----|-----------| |-----|-----------|
|word | counts | |bigra| counts |
|-----|-----|-----| |-----|-----|-----|
index |text | +1 | -1 | index |text | +1 | -1 |
|-----|-----|-----| |-----|-----|-----|
1 |this | 3 | 3 | 4 |bi | 6 | 2 |
2 |forum| 1 | 0 | 5 |gr | 10 | 3 |
3 |is | 10 | 12 | 6 |am | 8 | 10 |
|... | .. | .. | |.. | .. | .. |
|-----|-----|-----| |-----|-----|-----|
Lets say k = 2, Is this how a training instance would look like?(Counts are not affiliated with before)
Label Top_kWords1:33 Top_kWords2:27 Top_kBigrams1:30 Top_kBigrams2:25
Or does it look like this (Does it matter when the features mix up)?
Label Top_kWords1:33 Top_kBigrams1:30 Top_kWords2:27 Top_kBigrams2:25
I just want to know how the feature vector looks like with multiple and different features and how to it.
EDIT:
With the updated table above, is this training data correct?:
Example
1 1:3 2:1 3:10 4:6 5:10 6:8
-1 1:3 2:0 3:12 4:2 5:3 6:10
libSVM representation is purely numeric, so
label index1:value1 index2:value2
means that each "label", "index" and "value" have to be numbers. In your case you have to enumerate your features, for example
1 1:23 2:47 3:0 4:1
if some of the featues has value 0 then you can omit it
1 1:23 2:47 4:1
remember to leave features in increasing order.
In general, libSVM is not designed to work with texts, and I would not recommend you to do so - rather use some already existing library which make working with text easy and wraps around libsvm (such as NLTK or scikit-learn)
Whatever k most words/bigrams you use for training may not be the most popular in your test set. If you want to use the most popular words in the english language you will end up with the, and and so on. Maybee beer and footballare more suitable to classify males even if they are less popular. This process step is called feature selection and has got nothing to do with SVM. When you found selective features (beer, botox, ...) you do enumerate them and stuff them into SVM training.
For bigrams you maybe could omit feature selection as there is at most 26*26=676 bigrams making 676 features. But again I assume bigrams like be to be not selective as the selective match in beer is comleteley buried in lots of matches in to be. But that is speculation, you have to learn the quality of your features.
Also, if you use word/bigram counts you should normalize them, i. e. divide by the overall word/bigram count of your document. Otherwise shorter documents in your training set will have less weight than bigger ones.

Backpropagation: when to update weights?

Could you please help me with a neural network?
If I have an arbitrary dataset:
+---+---------+---------+--------------+--------------+--------------+--------------+
| i | Input 1 | Input 2 | Exp.Output 1 | Exp.Output 2 | Act.output 1 | Act.output 2 |
+---+---------+---------+--------------+--------------+--------------+--------------+
| 1 | 0.1 | 0.2 | 1 | 2 | 2 | 4 |
| 2 | 0.3 | 0.8 | 3 | 5 | 8 | 10 |
+---+---------+---------+--------------+--------------+--------------+--------------+
Let's say I have x hidden layers with different numbers of neurons and different types of activation functions each.
When running backpropagation (especially iRprop+), when do I update the weights? Do I update them after calculating each line from the dataset?
I've read that batch learning is often not as efficient as "on-line" training. That means that it is better to update the weights after each line, right?
And do I understand it correctly: an epoch is when you have looped through each line in the input dataset? If so, that would mean that in one epoch, the weights will be updated twice?
Then, where does the total network error (see below) come into play?
[image, from here.]
tl;dr:
Please help help me understand how backprop works
Typically, you would update the weights after each example in the data set (I assume that's what you mean by each line). So, for each example, you would see what the neural network thinks the output should be (storing the outputs of each neuron in the process) and then back propagate the error. So, starting with the final output, compare the ANN's output with the actual output (what the data set says it should be) and update the weights according to a learning rate.
The learning rate should be a small constant, since you are correcting weights for each and every example. And an epoch is one iteration through every example in the data set.

Resources