Machine Learning nominal data - machine-learning

Im working on machine learning with svm. I try to feed my svm with data, but my data is nominal and i have no idea how to transform it.
My data looks like:
--------------------------------------------------
Item | Productname | Label name | Packaging |etc...
--------------------------------------------------
1 | Battery Micro 4 | Batt. Micro | Folding|...
--------------------------------------------------
2 | Battery Micro 8 | Batt. Micro | Blister|...
--------------------------------------------------
3 | button cell Battery | btn Batt. | Blister | ...
--------------------------------------------------
I want to train my svm to identify that "Battery Micro 4" is column "Productname"
and "Batt. Micro" is column "label name" and Folding is column "Packaging" so on.
Methods like onehot seems not to be good for my case.
The number of items will increase after some time.
Does anyone know a method to transform these data to numerical values with less information loss?
thanks.

Since your data has no natural ordering it would be of no use to use integer encoding. Next option would have been one hot encoding but as you said the no of items can increase we can discard this as well, next option is to get a value count of all the discrete values you have sort them and then do an integer encoding going from smallest to largest, while doing this step you should also take care about the cardinality of discrete values, if you have an discrete element with a cardinality <1% it would be better to create a special category for these values and add all such value to that category, this way any new category that arrives during test time should be assign to this category as its cardinality would definitely be very low.

Related

I think I've gone crazy: Google Sheet Lookup

I'm trying to do a really simple lookup on a data validation field:
I have 4 simple values:
risk free
low
intermediate
high
and besides those I have the values for those:
0
0.1
0.2
0.5
I use the lookup formula as follows:
=LOOKUP(J11,H7:H10,I7:I10)
When I then change the values to let's say low or risk free, it always shows me the value 0.5.
But when I change the words to this:
abc
def
ghi
jkl
it delivers the right values.
I tried several different sheets, browsers as well as google accounts which uses different languages (english & german).
Can somebody please explain this to me :D
drop the lookup and use:
=VLOOKUP(J11, H7:I10, 2, 0)

Can Machine Learning help classify data

I have a data set as below,
Code | Description
AB123 | Cell Phone
B467A | Mobile Phone
12345 | Telephone
WP9876 | Wireless Phone
SP7654 | Satellite Phone
SV7608 | Sedan Vehicle
CC6543 | Car Coupe
Need to create a automated grouping based on the Code and Description. Lets assume I have so many such data already classified into 0-99 groups. Whenever a new data comes in with a Code and Description, the Machine Learning algorithm needs to automatically classify this based on the previously available data.
Code | Description | Group
AB123 | Cell Phone | 1
B467A | Mobile Phone | 1
12345 | Telephone | 1
WP9876 | Wireless Phone | 1
SP7654 | Satellite Phone | 1
SV7608 | Sedan Vehicle | 2
CC6543 | Car Coupe | 3
Can this be achieved to some level of accuracy? Currently this process is so manual. Any such ideas or references are there, please help with that.
Try reading up on Supervised Learning. You need to provide labels for your training data so that the algorithms know what are the correct answers - and are able to generate appropriate models for you.
Then you can "predict" the output classes for your new incoming data using the generated model(s).
Finally, you may wish to circle back to check the accuracy of the predicted results. If you then enter the labels for the newly received and predicted data then those data can then be used for further training on your model(s).
Yes, it's possible with supervised learning. You pick yourself a model which you "train" with the data you already have. The model/algorithm then "generalizes" to previously unseen data from the known data.
What you specify as a group would be called class or "label" which needs to be predicted based on 2 input features (code/description). Whether you input these features directly or preprocess them into more abstract features which suits the algorithm better, depends on which algorithm you choose.
If you have no experience with Machine Learning, you might start with learning some basics while testing already implemented algorithms in tools such as RapidMiner, Weka or Orange.
I don't think machine learning methods are the most appropriate for the solution of the problem, because text based machine learning algorithms tend to be quite complicated. From the examples you provided I'm not sure how
I think the simplest way of solving, or attempting to solve this problem is the following, which can be implemented in many free programming languages, such as python. Each description can be stored as a string. What you could do is to store all the substrings of all the strings (ie Phone is your string, the substrings will be 'P','h',Ph',..,'e') that belong in a particular group in a list (see this question for how to implement it in python... Substrings of a string using Python). Then you want to for each substring and all substrings stored, see which ones are unique to a certain group. Then select strings over a certain length (say 3 characters long, to get rid of random letter concatenations) as your classification criteria. Then when you get new data, check whether the description is unique to a certain group. With this for instance, you would be able to classify all objects that are in group 1 based on whether their description contains the word phone.
Its hard to provide concrete code to solves your problem without knowing what languages you are familiar with/are feasible to use. I hope this helps anyway. Yves

Normalize company names using a long set of rules

We have a large table (>30M rows) containing company names and other characteristics.
Data:
Company_id Type Name Adress (more...)
497651684 8 Big mall Toys'rUs BigMall adress
468468486 1 McDonnnals WhateverStreet
161684314 8 Toys R Us Another street
546846846 1 BgKing BigMall2 adress
484984988 5 IKEA store103 Other Mall
489616848 5 Mss Duty Addrs
484984984 5 Pull&Bear Adrss
468484867 5 Zara store Adress2
(...)
From that table, we have identified about ~300 company groups whose name could be normalized easily with something on lines of:
if type is (8,10,85,2)
and
(
contains name ("toys","us")
or
stringDistance name("toys R us") < (X)
)
new name is "Toys 'R us"
if type is (1,90,7)
and
(contains name("donalds")
or
stringDistance name("mcdonalds") < (X)
)
new name is "Mc donalds"
(...)
I'm not sure what would be the best approach for this honestly. We previously did an ad-hoc approach for a way smaller set with a simpler logic for a fast solution. But this time I would love to know what would be the ideal way.
While String edit distance e.g. stringDistance name("toys R us") < (X) is a good approach, I will also recommend trying to use clustering especially hierarchical clustering here.
All the names falling into the same cluster should have the same standard company name. For the above to work you will have to cut the dendogram (http://en.wikipedia.org/wiki/Dendrogram) of the hierarchy pretty close to the leaves. You will have to try different features (the ones used in calculating the distance or similarity) of your clustering to arrive at a suitable set. Examples can be representing each company name as a vector of characters and then using cosine similarity to measure distances. Btw, cosine similarity works great for texts.

Automatic people counting + twittering

Want to develop a system accurately counting people that go through a normal 1-2m wide door. and twitter whenever people goes in or out and tells how many people remain inside.
Now, Twitter part is easy, but people counting is difficult. There is some semi existing counting solution, but they do not quite fit my needs.
My idea/algorithm:
Should I get some infra-red camera mounting on top of my door and constantly monitoring, and divide the camera image into several grid and calculating they entering and gone?
can you give me some suggestion and starting point?
How about having two sensors about 6 inches apart. They could be those little beam sensors (you know, the ones that chime when you walk into some shops) placed on either side of the door jam. We'll call the sensors S1 and S2
If they are triggered in the order of S1 THEN S2 - this means a person came in
If they are triggered in the order of S2 THEN S1 - this means a person left.
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
| |
| |
| |
| |
S1 S2 this is inside the store
| |
| |
| |
| |
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
If you would like to have the people filmed by a camera you can try to segment the people in the image and track them using a Particle Filter for multi-object tracking.
http://portal.acm.org/citation.cfm?id=1561072&preflayout=flat
This is a paper by one of my professors. Maybe you wanna have a look at it.
If your camera is mounted and doesnt move you can use a substraction-method for segmentation of the moving people (Basically just substract two following images and all that stays where the things that move). Then do some morphological operations on it so only big parts (people) stay. Maybe even identify them by checking on rectangularity so you only keep "standing" objects.
Then use a Particle Filter to track the people in the scene automatically... And each new object would increase the counter...
If you want I could maybe send you a presentation I held a while ago (unfortunately its in German, but you can translate it)
Hope that helps...

Precision of reals through writeln/readln in Delphi

My clients application exports and imports quite a few variables of type real through a text file using writeln and readln. I've tried to increase the width of the fields written so the code looks like:
writeln(file, exportRealvalue:30); //using excess width of field
....
readln(file, importRealvalue);
When I export and then import and export again and compare the files I get a difference in the last two digits, e.g (might be off on the actual number of digits here but you get it):
-1.23456789012E-0002
-1.23456789034E-0002
This actually makes a difference in the app so the client wants to know what I can do about it. Now I'm not sure it's only the write/read that does it but I thought I'd throw a quick question out there before I dive into the hey stack again. Do I need to go binary on this?
This is not an app dealing with currency or something, I just write and read the values to/from file. I know floating points are a bit strange sometimes and I thought one of the routines (writeln/readln) may have some funny business going on.
You might try switching to extended for greater precision. As was pointed out though, floating point numbers only have so many significant digits of precision, so it is still possible to display more digits then are accurately stored, which could result in the behavior you specified.
From the Delphi help:
Fundamental Win32 real types
| Significant | Size in
Type | Range | digits | bytes
---------+----------------------------------+-------------+----------
Real | -5.0 x 10^–324 .. 1.7 x 10^308 | 15–16 | 8
Real48 | -2.9 x 10^–39 .. 1.7 x 10^38 | 11-12 | 6
Single | -1.5 x 10^–45 .. 3.4 x 10^38 | 7-8 | 4
Double | -5.0 x 10^–324 .. 1.7 x 10^308 | 15-16 | 8
Extended | -3.6 x 10^–4951 .. 1.1 x 10^4932 | 10-20 | 10
Comp | -2^63+1 .. 2^63–1 | 10-20 | 8
Currency | -922337203685477.5808.. | |
922337203685477.5807 | 10-20 | 8
Note: The six-byte Real48 type was called Real in earlier versions of Object Pascal. If you are recompiling code that uses the older, six-byte Real type in Delphi, you may want to change it to Real48. You can also use the {$REALCOMPATIBILITY ON} compiler directive to turn Real back into the six-byte type. The following remarks apply to fundamental real types.
Real48 is maintained for backward compatibility. Since its storage format is not native to the Intel processor architecture, it results in slower performance than other floating-point types.
Extended offers greater precision than other real types but is less portable. Be careful using Extended if you are creating data files to share across platforms.
Notice that the range is greater then the significant digits. So you can have a number larger then can be accurately stored. I would recommend rounding to the significant digits to prevent that from happening.
If you want to specify the precision of a real with a WriteLn, use the following:
WriteLn(RealVar:12:3);
It outputs the value Realvar with at least 12 positions and a precision of 3.
When using floating point types, you should be aware of the precision limitations on the specified types. A 4 byte IEEE-754 type, for instance, has only about 7.5 significant digits of precision. An eight byte IEEE-754 type has roughly double the number of significant digits. Apparently, the delphi real type has a precision that lies around 11 significant digits. The result of this is that any extra digits of formatting that you specify are likely to be noise that can result in conversions between base 10 formatted values and base 2 floating point values.
First of all I would try to see if I could get any help from using Str with different arguments or increasing the precision of the types in your app. (Have you tried using Extended?)
As a last resort, (Warning! Workaround!!) I'd try saving the customer's string representation along with the binary representation in a sorted list. Before writing back a floating point value I'd see if there already is a matching value in the table, whose string representation is already known and can be used instead. In order to make get this lookup quick, you can sort it on the numeric value and use binary search for finding the best match.
Depending on how much processing you need to do, an alternative could be to keep the numbers in BCD format to retain original accuracy.
It's hard to answer this without knowing what type your ExportRealValue and ImportRealValue are. As others have mentioned, the real types all have different precisions.
It's worth noting, contrary to some thought, extended is not always higher precision. Extended is 10-20 significant figures where double is 15-16. As you are having trouble around the tenth sig fig perhaps you are using extended already.
To get more control over the reading and writing you can convert the numbers to and from strings yourself and write them to a file stream. At least that way you don't have to worry if readln and writeln are up to no good behind your back.

Resources