Clustering List of Strings - machine-learning

Let's we have this kind of long list:
list = ['computer','computers 12','old computer','laptop','lap top','laptops']
Is there any method to convert similar strings into categories into clusters? For longer text, the algorithms like LDA are based on words, not the characters. There is also fuzzy-wuzzy a library for checking similarity but it just checks the similarity of pair of words. By the way, I want to get this kind of output:
output = [1,1,1,2,2,2]
Where 1 reflects the computers, 2: laptops based on their similar writings.

After tedious algorithmic work, I wrote a solution like this. It simply assigns a similar name as a cluster rather than numbers.
import WordCluster
list = ['computer','computers 12','old computer','laptop','lap top','laptops']
wcu = WordCluster(list,4)
cluster_categories = wcu.find_cluster_categories()
categorized_words =wcu.categorize_all_words(self.cluster_categories)
print(
categorized_words)
The result became like this. The name of the cluster is automatically generated from our initial list and 'old computer' i.e. is the last type of computer-like word and 'laptop' is also for the latter series.
[{'word': 'computer', 'cluster': 'oldcomputer'},
{'word': 'computeri', 'cluster': 'oldcomputer'},
{'word': 'oldcomputer', 'cluster': 'oldcomputer'},
{'word': 'laptop', 'cluster': 'laptop'},
{'word': 'lap top', 'cluster': 'laptop'},
{'word': 'laptops', 'cluster': 'laptop'}]

Related

What is the difference between makelist() and create_list() in MAXIMA?

I have seen that there are two similar functions to create lists in maxima: create_list() and makelist(). In both cases, the arguments can be
(<an expression>, <a variable>, <the initial value>, <the final value>, < the step>) or
(<an expression>, <a variable>, <a list of values for the variable>).
What is the difference between these two functions? I have tried a couple of examples and they seem to work in the same way:
makelist(i^i,i,1,3); -> [1,4,27]
create_list(i^i,i,1,3); -> [1,4,27]
makelist(i^i,i,[1,2,3]); -> [1,4,27]
create_list(i^i,i,[1,2,3]); -> [1,4,27]
If you wish, you can create your own function, with its own syntax, in maxima.
For example, there is no operator ".." but this makes it happen.
infix("..",80,80,expr,expr,expr);
You can then define the semantics ...
here I just call a function named range.
(a..b):= range(a,b)
This doesn't provide for all the embroidery that you might like.
I think a superior technique for syntax and semantics is to enhance the "for" loop as in this example:
for i:1 thru 5 do collect i;
which returns [1,2,3,4,5]
All the varied mechanisms for "for", including step size, limit, iterating through sets, etc. can then be included in computing a list explicitly comprising a range. The code for this is about 7 lines of lisp, inserted into the source code for "parse-$do".
I also allow
for i in [a,b] summing f(i) ; which returns f(b)+f(a).
This enhancement is redundant for the (few) people who are comfortable with map, cons, lambda, apply, append ... in Maxima.
The code, which can be read in to any maxima, is here.
https://people.eecs.berkeley.edu/~fateman/lisp/doparsesum.lisp

Splitting a string variable delimited list into individual binary variables in SPSS

I have a string variable created from a checkbox questions (Which of the following assets do you own?)
I am trying to create individual binary variables for each type of asset based on whether that number is present in the string list.
The syntax I am using cannot differentiate between 1 and 11.
do repeat wrd="1," ",2," ",3," ",4," ",5," ",6," ",7,"/NewVar= W3_CG_asset_TV_1 W3_CG_asset_radio_2 W3_CG_asset_payTV_3 W3_CG_asset_tel_4 W3_CG_asset_cellphone_5
W3_CG_asset_fridge_6 W3_CG_asset_freezer_7.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
do repeat wrd= ",8," ",9," ",10," ",11," ",12," ",13," ",14," ",15," ",16," ",17," ",18," ",19," /NewVar= W3_CG_asset_electricstove_8 W3_CG_asset_primusstove_9
W3_CG_asset_gasstove_10 W3_CG_asset_electrickettle_11 W3_CG_asset_microwave_12 W3_CG_asset_computer_13 W3_CG_asset_electricity_14 W3_CG_asset_geyser_15
W3_CG_asset_washingmachine_16 W3_CG_asset_workingvehicle_17 W3_CG_asset_bicycle_18 W3_CG_asset_donkeyhorse_19.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
I have tested this on SPSS 28.
make sure the column W3_CG_HouseExpen1 is string and length of it long enough to hold the data.
Then I added execute
data list list/W3_CG_HouseExpen1 (a50).
begin data
"1,2,11,12,"
"2,12,"
"1,2,"
"1,11,12,"
end data.
do repeat
wrd="1," ",2," ",3," ",4," ",5," ",6," ",7," ",8," ",9," ",10," ",11," ",12," ",13," ",14," ",15," ",16," ",17," ",18," ",19,"
/NewVar = W3_CG_asset_TV_1 W3_CG_asset_radio_2 W3_CG_asset_payTV_3 W3_CG_asset_tel_4 W3_CG_asset_cellphone_5 W3_CG_asset_fridge_6 W3_CG_asset_freezer_7
W3_CG_asset_electricstove_8 W3_CG_asset_primusstove_9 W3_CG_asset_gasstove_10 W3_CG_asset_electrickettle_11 W3_CG_asset_microwave_12 W3_CG_asset_computer_13 W3_CG_asset_electricity_14 W3_CG_asset_geyser_15
W3_CG_asset_washingmachine_16 W3_CG_asset_workingvehicle_17 W3_CG_asset_bicycle_18 W3_CG_asset_donkeyhorse_19.
compute NewVar=char.index(W3_CG_HouseExpen1, wrd)>0.
end repeat.
EXECUTE.
My suggestion is to run through this in reverse, erasing the values you've already recognized. So if you've got "11" and erased it, when you later search for "1" you won't find it in an "11".
I recreated a tiny exaple dataset to demonstrate on (EDIT-improved example):
data list list/W3_CG_HouseExpen1 (a50) .
begin data
"1,2,11,12,"
"11,12,"
"2,11,"
end data.
Now I do the whole process on a copy of the original W3_CG_HouseExpen1 variable so I can eat it away without damage to the original data:
string #temp(a50).
compute #temp=W3_CG_HouseExpen1.
do repeat wrd="12," "11," "2," "1," /NewVar= W3_12 W3_11 W3_2 W3_1.
compute NewVar=char.index(#temp, wrd)>0.
compute #temp=replace(#temp, wrd, ""). /*deleting the search string from the full string.
end repeat.
exe.

Calculated metrics: create metrics expression

I want to create Calculated metrics expression for same Logical expression for example by Java
if (KPI<=95 & FailedCount!=0) {
STATUS=1;}
else {STATUS=0;}
In Site Scope I wrote this expression
((<<KPI>><=95)&(<<FailedCount>>!=0))
But I do not like the result
When KPI=0 and FailedCount=0;
STATUS=0,
then KPI=100 and FailedCount=0
STATUS='n/a'.
How to reslove this problem?
p.s.
Add question on HP Community too
There's a ternary operator you can use:
(Boolean Expression)? resultIfExpressionIsTrue: resultIfExpressionIsFalse
In your case you could try to use something like:
((<<KPI>><=95)&(<<FailedCount>>!=0))? 1: 0
You may also need to consider if you want the result to be 0 and 1 as integers (as above) or as strings in which case they should be put in between " marks. This is important from the perspective if you'd like to apply arithmetic or string like thresholds to the resulting calculated metric, also if you'd like the result to be seen as numeric or string values in other places, like in OMi or Service Health etc.

Scikit-learn: How to extract features from the text?

Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Erlang: What is most-wrong with this trie implementation?

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Resources