How to K-Means clustering of pdf raw data - machine-learning

I want to cluster pdf documents based on their structure, not only the text content.
The main problem with the text only approach is, that it will loose the information if a document has a pdf form structure or was it just a plain doc or does it contain pictures?
For our further processing these information are most important.
My main goal is now to be able to classify a document regarding mainly its structure not only the text content.
The documents to classify are stored in a SQL database as byte[] (varbinary), so my idea is now to use the this raw data for classification, without prior text conversion.
Because if I look at the hex output of these data, I can see repeating structures which seems to be similar to the different doc classes I want to separate.
You can see some similar byte patterns as first impression in my attached screenshot.
So my idea is now to train a K-Means model with e.g. a hex output string.
In the next step I would try to find the best number of clusters with the elbow method, which should be around 350 - 500.
The size of the pdf data varies between 20 kByte and 5 MB, mostly around 150 kBytes. To train the model I have +30.k documents.
When I research that, the results are sparse. I only find this article, which make me unsure about the best way to solve my task.
https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided
My questions are:
Is K-Means the best algorithm for my goal?
What method do you would recommend?
How to normalize or transform the data for the best results?

Like Ian in the comments said, to use raw data seems a bad idea.
With further research I found the best solution to first read the structure of the PDF file e.g. with an approach like this:
https://github.com/Uzi-Granot/PdfFileAnaylyzer
I normalized and clustered the data with this information, which gives me good results.

Related

Best way to encode categorical data(URLs) in large dataset in Machine Learning?

I have a large dataset where one of the feature is categorical(nominal) named URL which conatins different URLs. For example, www.google.com, www.facebook.com, www.youtube.com, www.yahoo.com, www.amazon.com, etc. There are more than 500 different URLs in a million rows.
Which is the best way to encode this categorical feature so that I can pass the encoded feature to Logistic Regression model?
I have tried using label encoding from sklearn but it didn't work well as just labeling the URLs with 1, 2, 3,... doesn't form any relation between them.
I tought of using one hot encoding but it will create 500+ new feature for my model and unnecessarily increase the complexity of the model.
Code and data is confidential, I can't provide.
Label encoding didn't work well and one-hot encoding will make the model too complex.
I would first ask if this variable is completely necessary? Is it something that can be dropped?
If it cannot be dropped, I would do a frequency plot of the websites that appear. The websites you mention might show up significantly more than some other obscure websites. I would use the histogram to pick maybe the top 10 or 12, etc.

Deep learning - Find patterns combining images and bios data

I was wondering if is it possible combining images and some "bios" data for finding patterns. For example, if I want to know if a image is a cat or dog and I have:
Enough image data for train my model
Enough "bios" data like:
size of the animal
size of the tail
weight
height
Thanks!
Are you looking for a simple yes or no answer? In that case, yes. You are in complete control over building your models which includes what data you make them process and what predictions you get.
If you actually wanted to ask on how to do it, it will depend on specific datasets and application but one way to do it would be by having two models, one specialized for determining the output label (cat or dog) from the image - so perhaps some kind of a simple CNN. The other would process the text data and find patterns in that. Then at the end, you could have either a non-AI evaluator that would combine these two predictions into one naively or you could have both of these models as an input to a simple neural network that would learn pattern from the output of these two models.
That is just one way to possibly do it though and, as I said, the exact implementation will depend on a lot of other factors. How are both of the datasets labeled? Are the data connected to each other? Meaning that, for each picture, do you have some textual data that is for that specific image? Or do you jsut have a spearated dataset of pictures and separate dataset of biological information?
There is also the consideration that you'll probably want to make about the necessity of this approach. Current models can predict categories from processing images with super-human precision. Unless this is an excersise in creating a more complex model, this seems like an overkill.
PS: I wouldn't use term "bios" in this context, I believe it is not a very common usage and here on SO it will mostly confuse people into thinking you mean the actual BIOS.

Machine learning data modeling

I am a beginner in Machine learning. I have seen videos which teaches machine learning. But my questions is How can we model our data.
Mostly we get unstructured data. How can I convert that unstructured data into structured format, The BEST way. So that we can find the most useful information from the data.
Any help w.r.t books or links is very thankful.
As a machine learning engineer, You will be responsible for preprocessing your data in a way such that it will be acceptabele. by the model.
There is no best way to do this and moresoo, it depends on what type of data you have such as 1. csv datasets, 2. Text dataset, file(image & audio).
In the real world all the data will not be in a structured form. When we get the data very first thing is find
1. what is the data is all about.
2. what are the features of it and output of it.
Ex: Dataset to predict the height a person and you have all the below info like from which country, Weight, Gender, Hair color etc.. these are the features we say usually term in Machine learning.
3. Then we need to see how the data features are. Like text data or numerical etc.. We need to pre-process the data before we do any analysis of the data. For Ex: In case you data, a feature is all about a review then you need remove all the special function and corpous your data.
4. You need to understand the way model accepts the data and parameters the model has how can we improve the data.( We can do some feature engineering to improve the models etc..)
There is no hard and fast rule you need to do in the same way.
First, you need to learn about preprocessing and feature extraction. If you make a model in Python, then libraries like Pandas or Scikit learn are very useful. As a first step try to create sentences like "when x occurs then my output y becomes ...".
Before modeling, the data has to be cleaned. There are several methods to clean the data. Go through the link on how to convert data from unstructured data to structured data.
https://www.geeksforgeeks.org/how-to-convert-unstructured-data-to-structured-data-using-python/

Encoding large numbers of categorical variables as input data

One hot encoding doesn't sound like a great idea when you're dealing with hundreds of categories e.g. a data set where one of the columns is "first name". What's the best approach to go about encoding this sort of data?
I recommend the hashing trick:
https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick
It's cheap to compute, easy to use, allows you to specify the dimensionality, and often serves as a very good basis for classification.
For your specific application, I would hash feature-value pairs, like ('FirstName','John'), then increment the bucket for the hashed value.
If you have a large number of categories, Classification algorithm does not work well. Instead, there is a better approach of doing this. You apply regression algorithm on data and then train offset on those output. It would give you better results.
A sample code can be found here.

How to include datetimes and other priority information for clustering?

I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck

Resources