How to quantify these features so they can be analysed upon using Logistic Regression? - machine-learning

I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:
Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?
The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.

Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)

Related

Ordinal Encoding or One-Hot-Encoding

IF we are not sure about the nature of categorical features like whether they are nominal or ordinal, which encoding should we use? Ordinal-Encoding or One-Hot-Encoding?
Is there a clearly defined rule on this topic?
I see a lot of people using Ordinal-Encoding on Categorical Data that doesn't have a Direction.
Suppose a frequency table:
some_data[some_col].value_counts()
[OUTPUT]
color_white 11413
color_green 4544
color_black 1419
color_orang 3
Name: shirt_colors, dtype: int64
There are a lots of guys who are preferring to do Ordinal-Encoding on this column. And I am hell-bent to go with One-Hot-Encoding.
My view on this is that doing Ordinal Encoding will allot these colors' some ordered numbers which I'd imply a ranking. And there is no ranking in the first place. In other words, my model should not be thinking of color_white to be 4 and color_orang to be 0 or 1 or 2.
Keep in mind that there is no hint of any ranking or order in the Data Description as well.
I have the following understanding of this topic:
Numbers that neither have a direction nor magnitude are Nominal Variables. For example, fruit_list =['apple', 'orange', banana']. Unless there is a specific context, this set would be called to be a nominal one. And for such variables, we should perform either get_dummies or one-hot-encoding
Whereas the Ordinal Variables have a direction. For example, shirt_sizes_list = [large, medium, small]. These variables are called Ordinal Variables. If the same fruit list has a context behind it, like price or nutritional value i-e, that could give the fruits in the fruit_list some ranking or order, we'd call it an Ordinal Variable. And for Ordinal Variables, we perform Ordinal-Encoding
Is my understanding correct?
Kindly provide your feedback
This topic has turned into a nightmare
Thank you!
You're right. Just one thing to consider for choosing OrdinalEncoder or OneHotEncoder is that does the order of data matter?
Most ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases e.g., for ordered categories such as:
quality = ["bad", "average", "good", "excellent"] or
shirt_size = ["large", "medium", "small"]
but it is obviously not the case for the:
color = ["white","orange","black","green"]
column (except for the cases you need to consider a spectrum, say from white to black. Note that in this case, white category should be encoded as 0 and black should be encoded as the highest number in your categories), or if you have some cases for example, say, categories 0 and 4 may be more similar than categories 0 and 1. To fix this issue, a common solution is to create one binary attribute per category (One-Hot encoding)

How to check an input string contains street address or not?

We want to identify the address fields from a document. For Identifying the address fields we converted the document to OCR files using Tesseract. From the tesseract output we want to check a string contains the address field or not . Which is the right strategy to resolve this problem ?
Its not possible to solve this problem using the regex because address fields are different for various documents and countries
Tried NLTK for classifying the words but not works perfectly for address field.
Required output
I am staying at 234 23 Philadelphia - Contains address files <234 23 Philadelphia>
I am looking for a place to stay - Not contains address
Provide your suggestions to solve this problem .
As in many ML problems, there are mutiple posible solutions, and the important part(and the one commonly has greater impact) is not which algorithm or model you use, but feature engineering ,data preprocessing and standarization ,and things like that. The first solution comes to my mind(and its just an idea, i would test it and see how it performs) its:
Get your training set examples and list the "N" most commonly used words in all examples(thats your vocabulary), this list will contain every one of the "N" most used words , every word would be represented by a number(the list index)
Transform your training examples: read every training example and change its representation replacing every word by the number of the word in the vocabolary.
Finally, for every training example create a feature vector of the same size as the vocabulary, and for every word in the vocabulary your feature vector will be 0(the corresponding word doesnt exists in your example) or 1(it exists) , or the count of how many times the word appears(again ,this is feature engineering)
Train multiple classifiers ,varing algorithms,parameters, training set sizes, etc, and do cross validation to choose your best model.
And from there keep the standard ML workflow...
If you are interested in just checking YES or NO and not extraction of complete address, One simple solution can be NER.
You can try to check if Text contains Location or not.
For Example :
import nltk
def check_location(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
if hasattr(chunk, "label"):
if chunk.label() == "GPE" or chunk.label() == "GSP":
return "True"
return "False"
text="I am staying at 234 23 Philadelphia."
print(text+" - "+check_location(text))
text="I am looking for a place to stay."
print(text+" - "+check_location(text))
Output:
# I am staying at 234 23 Philadelphia. - True
# I am looking for a place to stay. - False
If you want to extract complete address as well, you will need to train your own model.
You can check: NER with NLTK , CRF++.
You're right. Using regex to find an address in a string is messy.
There are APIs that will attempt to extract addresses for you. These APIs are not always guaranteed to extract addresses from strings, but they will do their best. One example of an street address extract API is from SmartyStreets. Documentation here and demo here.
Something to consider is that even your example (I am staying at 234 23 Philadelphia) doesn't contain a full address. It's missing a state or ZIP code field. This makes is very difficult to programmatically determine if there is an address. Once there is a state or ZIP code added to that sample string (I am staying at 234 23 Philadelphia PA) it becomes much easier to programmatically determine if there is an address contained in the string.
Disclaimer: I work for SmartyStreets
A better method to do this task could be as followed below:
Train your own custom NER model (extending pre-trained SpaCy's model or building your own CRF++ / CRF-biLSTM model, if you have annotated data) or using a pre-trained models like SpaCy's large model or geopandas, etc.
Define a weighted score mechanism based on your problem statement.
For example - Let's assume every address have 3 important components - an address, a telephone number and an email id.
Text that would have all three of them would get a score of 33.33% + 33.33% + 33.33% = 100 %
For identifying if it's an address field or not you may take into account - the per% of SpaCy's location tags (GPE, FAC, LOC, etc) out of total tokens in text which gives a good estimate of how many location tags are present in text. Then run a regex for postal codes, and match the found city names with the 3-4 words just before the found postal code, if there's an overlap, you have correctly identified a postal code and hence an address field - (got your 33.33% score!).
For telephone numbers - certain checks and regex could do it but an important criteria would be that it performs these phone checks only if an address field is located in above text.
For emails/web address again you could perform nomial regex checks and finally add all these 3 scores to a cumulative value.
An ideal address would get 100 score while missing fields wile yield 66% etc. The rest of the text would get a score of 0.
Hope it helped! :)
Why do you say regular expressions won't work?
Basically, define all the different forms of address you might encounter in the form of regular expressions. Then, just match the expressions.

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

Redis optimal hash set entry size

I have some questions regarding the optimal entry size setting for Redis hash sets.
In this example memory-optimization they use 100 hash entries
per key but use hash-max-zipmap-entries 256 ? Why not
hash-max-zipmap-entries 100 or 128?
On the redis website (above link) they used max hash entry size of
100, but in this post instagram, they mention 1000 entries. So
does this mean the optimal setting is a function of the product of
hash-max-zipmap-entries & hash-max-zipmap-value ?(ie in this case
Instagram has smaller hash-values than memory optimization example?)
Your comments/clarifications are much appreciated.
The key is, from here:
manipulating the compact versions of these [ziplist] structures can become slow as they grow longer
and
[as ziplists grow longer] fetching/updating individual fields of a HASH, Redis will have to decode many individual entries, and CPU caches won’t be as effective
So to your questions
This page just shows an example and I doubt the author gave much thought to the exact values. In real life, IF you wanted to take advantage of ziplists, and you knew your number of entries per hash was <100, then setting it at 100, 128 or 256 would make no difference. hash-max-zipmap-entries is only the LIMIT over which you're telling Redis to change the encoding from ziplist to hash.
There may be some truth in your "product of hash-max-zipmap-entries & hash-max-zipmap-value" idea, but I'm speculating. More importantly, first you have to define "optimal" based on what you want to do. If you want to do lots of HSET/HGETs in a large ziplist, it will be slower than if you used a hash. But if you never get/update single fields only ever do HMSET/HGETALL on a key, large ziplists wouldn't slow you down. The Instagram 1000 was THEIR optimal number based on THEIR specific data, use cases, and Redis function call frequencies.
You encouraged me to read both links and it seems that you are asking for "default value for hash table size".
I don't think that it's possible to say that one number is universal for all possibilities. The described mechanism is similar to standard hash mapping. Look at http://en.wikipedia.org/wiki/Hash_table
If you have small size of hash-table, it means that many various hash values point into the same array, where the equals method is used to find out the item.
On the other hand, large hash table means that it allocates large memory along with many empty fields. But this scales well as the algorithm uses O(1) big O notation and there is no equals searching for the item.
In general the size of the table IMHO depends on the overall count of all elements you expect to put into the table and it also depends on the diversity of the key. I mean if every hash start with "0001" not even size=100000 would help you.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Resources