i recently started machine learning tutorial and very first tutorial was supervised learning (spam and ham), i started by implementing it.
my implementation:
---------total spam count-------------
hi free offers for you and the ! ....
5 3 9 4 4 6 8 6
---------total ham count-------------
hi free offers for you and the ! ....
3 5 3 7 3 4 6 2
mail_1 : hi! how are you here are some free offers for you !!!
hi how are you here are some free offers for you !!!
1 1 2 1 1 2 1 1 1 1 1 4
s[T] = c_spam(T) / ( c_spam(T) + c_ham(T) )
s[T] = how spammy is the word T
c_spam(T) = how many spam messages contain the word T
c_ham(T) = how many non-spam message contain the word T
Now i have two questions:
1) Is this implementation is correct?
2) now after the result of this machine if i found the new mail is spam then would i need to update the old spam model?
Related
I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.
I have a dataset where each ID has visited a website and recorded their risk level which is coded 0-3. They have then returned to the website at a future date and recorded their risk level again. I want to calculate the difference between each ID's risk level from their first recorded risk level.
For example my dataset looks like this:
ID Timestamp RiskLevel
1 20-Jan-21 2
1 04-Apr-21 2
2 05-Feb-21 1
2 12-Mar-21 2
2 07-May-21 3
3 09-Feb-21 2
3 14-Mar-21 1
3 18-Jun-21 0
And I would like it to look like this:
ID Timestamp RiskLevel DifFromFirstRiskLevel
1 20-Jan-21 2 .
1 04-Apr-21 2 0
2 05-Feb-21 1 .
2 12-Mar-21 2 1
2 07-May-21 3 2
3 09-Feb-21 2 .
3 14-Mar-21 1 -1
3 18-Jun-21 0 -2
What should I do?
One way to approach this is with the strategy in my answer here, but I will use a different approach here:
sort cases by ID timestamp.
compute firstRisk=risklevel.
if $casenum>1 and ID=lag(ID) firstRisk=lag(firstRisk).
execute.
compute DifFromFirstRiskLevel=risklevel-firstRisk.
I have such columns in GS:
Equipments Amount . Equipment 1 Equipment 2
---------- ------- ----------- -----------
Equipment 1 2 Process 1 Process 3
Equipment 2 3 Process 2 Process 4
Process 5
I need to produce equipment 1 x2, and equipment 2 x3.
When equipments are produced, then Process 1 is executed 2 times, Process 2 - 2 times, Process 3 - 3 times, Process 4 - 3 times, Process 5 - 3 times.
So I need to generate such list:
Process 1
Process 1
Process 2
Process 2
Process 3
Process 3
Process 3
Process 4
Process 4
Process 4
Process 5
Process 5
Process 5
Of course, I want a formula which will be dynamic (e.g. can add another equipment or change processes in particular equipment)
1 list using rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",B2),C2:C<>"")),","))
Multy-list rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",VLOOKUP(D2:D,A:B,2,)),C2:C<>"")),","))
There is no easy way to solve your problem with formulas.
I would strongly suggest you write a script. It's easier than you think. You can even record an action, and then see the code you need to reproduce the action.
I have a time series dataframe like blow, and the number in it is meaning less, and I have some problems when applying LSTM.
I have saw some demos of LSTM, mostly use this pattern: [y_{t-2},y_{t-1},y_{t}] to predict [y_{t+1}], but just as the dataframe blow, I also have featureA, featureB, featureC, so my quesiton is: how to use multi inputs or multi features for LSTM
time featureA featureB featureC target
1 2 5 6 1
2 4 1 7 3
3 6 2 1 5
4 2 4 0 7
5 7 6 1 5
6 9 3 2 8
7 1 2 3 5
8 2 9 5 10
9 1 10 7 6
10 3 2 2 11
For RNN/LSTM, it is more like this: [..., y_{t-2}(x_{t-2}), y_{t-1}(x_{t-1})] to predict [y_{t}(x_{t})]
Or more succinctly:
y_{t} = f(y_{t-1}, x_{t})
So in feed forward you still use your inputs x_{t} (i.e. your features) plus the outputs from previous timesteps to make the prediction at the current timestep.
I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.