Developing a web bot crawler system after clustering the bots - machine-learning

I am trying to identifying high hitting IP's over a duration of time.
I have performed clustering on certain features, got a 12 cluster output, out of which 8 were bots and 4 were humans, as per the centroid values of the cluster.
Now What technique can I use to analyze the data within the cluster, so as to know that the data points within the cluster are in the right clusters.
In other words, are there any statistical methods to check the quality of the clusters.?
What I can think of is , if I take a data point which is at the boundary of the cluster, If I measure the distance of this point from the other Centroids and from its own Centroid, then can I get to know how close the two clusters are to my point and may be how well are my data divided in cluster ??
Kindly guide how to measure the quality of my clusters, with respect to data points and what are the standard technique to do so.
Thanks in Advance.!!
Cheers.!

With k-means, the chances are that you already have a big heap of garbage. Because it is an incredibly crude heuristic, and unless you were extremely careful at designing your features (at which point you would already know how to check the quality of a cluster assignment) the result is barely better than choosing a few centroids at random. In particular with k-means, which is very sensitive to the scale of your features. The results are very unreliable if you have features of different types and scale (e.g. height, shoe size, body mass, BMI: k-means on such variables is statistical nonsense).
Do not dump your data into a clustering algorithm and expect to get something useful. Clustering follows the GIGO principle: garbage-in-garbage-out. Instead, you need to proceed as follows:
identify what is a good cluster in your domain. This is very data and problem dependant.
choose a clustering algorithm with a very simialar objective.
find a data transformation, distance function or modification of the clustering algorithm to align with your objective
carefully double-check the result for trivial, unwanted, biased and random solutions.
For example, if you blindly threw customer data into clustering algorithm, the chances are it will decide the best answer to be 2 clusters, corresponding to the attributes "gender=m" and "gender=f"simply because this is the most extreme factor in your data. But because this is a know attribute, this result is entirely useless.

Related

How can I further analyze high frequency data from discrete wavelet transform?

I applied a discrete wavelet transform to horizontal wind speed data to receive the below plot. I'm basically trying to use the information from the detail coefficient (the turbulent flow) for further analysis, but I'm not sure the best direction to go in. I don't have much experience with Wavelet Transform, so forgive me if there are obvious options, but the examples I've seen usually discard the higher frequency information since it's the noise of the signal. Is there anything further I can do with this discrete wavelet transform like statistic analysis or forecasting?
The path to pursue really depends on the question that you are trying to answer.
First of all, I would suggest double checking that your DWT is actually doing what you expect it to do. The plot that you shared suggests that it is successful in separating the low frequency coherent (laminar?) flow from the high frequency turbulent flow, but it would be helpful to figure out which frequencies are present in the high frequency component in order to confirm that the processing parameters (e.g. decomposition level) were properly chosen.
Once convinced that your wavelet decomposition provides you with useful information about the turbulent flow, what should you do with these high pass filtered data?
I suggest computing their variance over 1 hour long intervals. This is a measure of the "energy" of the signal over the chosen interval. If you are dealing with large amounts of data this would allow you to boil down your time series into a single sample per hour. Maybe you will be able to spot diurnal variations in the turbulent flow (e.g. maybe turbulent flow is higher at dawn). If you have multiple stations it would be interesting to study if the turbulence variations share the same behavior.
Before venturing into time series forecasting, I would really take a closer look at you data and try to identify trends or nail down possible outliers.
Last but not least, I would suggest posting your question on Physics Stack Exchange (e.g. https://physics.stackexchange.com/) rather than on SO.

What is the definition of the terms SPATIAL and TEMPORAL in terms of statistics, data science or machine learning

What is the exact definition of spatial and temporal? I saw in many places people use these two terms, e.g., spatial vector, temporal vector, temporal factor, spatial location.
I was searching in StackOverflow, and found this one- what's the difference between spatial and temporal characterization in terms of image processing?
What I understood so far is that the term spatial is related to space and the term temporal is related to time. Still, it is quite abstract to me. Again, I am also not sure about the uses of these two. So, as same as the person asked in the above link, I want to ask the same question- What do these two terms mean and why do we care about these two?
Spatial data have to do with location-aware information, in other words, data that have coordinates (x, y). A typical example of spatial data is latitude and longitude in geographic datasets. Spatial analyses are the techniques involved in analyzing spatial data. This is a significant component of GIS (Geographic Information Systems/Science)
Temporal data is time-series data. In other words, this is data that is collected as time progresses. Temporal analysis is also known as Time-Series analysis. These are the techniques for analyzing data units that change with time.
I hope this makes these concepts less abstract and more concrete.
Adding to Ekaba's answer, spatial data doesn't necessarily need to be two dimensional either. I'm going to take an example from a medical domain which would have both spatial and temporal elements of data.
If you consider magnetic resonance imaging, it is essentially a 3D Volumetric view of an organ (let's say brain for clarity). So if you are to analyse a traditional MRI, it would be spatial analysis and you'll have 3 dimensions as it is 3D. There's another MRI modality called DCE-MRI which is essentially a sequence of MRI volumes captured over time. Now this is a typical example of a temporal sequence. Let's say DCE-MRI sequence has 40 MRI volumes captured 20s apart from each. If you just consider one sequence out of these 40 and analyse that, you'll be analyzing it spatially whereas if you consider all 40 (or a subset) of these volumes at the same time, you are analyzing it spatially as well as temporally.
Hope that clarifies things.
Another similar medical example is ultrasound imaging of a beating heart (2D Echocardiography) where the ultrasound image shows opening and closing movement of heart valves in real-time and volumetric movement of heart chambers. With high temporal resolution (# 30 frames per second) it is easy to follow the valves opening and closing accurately. With high spatial resolution it is also easy to differentiate boarders of the heart chambers to provide accurate volumetric blood flow data.

Compare Images By Color Using a Genetic Algorithm

I am right now in a serious problem, I need to compare images of flowers (carnations) using a genetic algorithm, the program must determine which variety does the flower belongs to (until now I am using 15 different varieties), the thing is I am having difficulties constructing the chromosome, right know I am only analysing the HSV of each image, then a take every channel and calculate the mean for each (n=255), after that I calculate the correlation between HS, HV and SV, I expected that the mean would be enough to locate any new flower next to the clusters of flowers of the variety it belongs (by the way, I have a database of all the flowers used for training purpose) by calculating the distance between the mean of the flower and the centroids of each cluster, and probably using the correlations for adjustment, but that distance is usually way smaller to a different variety than the one it must be. Is there a way to classify this flowers using ONLY colours (I've read of applications that uses the texture, but that's way out of my league), especially using a genetic algorithm (I know Neural Networks are more appropriate to this kind of analysis but that's what the teacher asked)?. Thank you very much. By the way I am working on OpenCV, don’t know if it's relevant. PS: Excuse my English if any mistakes were done, not my native language.

need advise on sift feature - is there such thing as a good feature?

I am trying out vlfeat, got huge amount of features from an image database, and I am testing with the ground truth for mean average precision (MAp). Overall, I got roughly 40%. I see that some of the papers got higher MAp, while using techniques very similar to mine; the standard bag of word.
I am currently looking for an answer for obtaining higher MAp for the standard bag of word technique. While I see that there are other implementation such as SURF and what not, let's stick to the standard Lowe's SIFT and the standard bag of word in this question.
So the thing is this, I see that vl_sift got thresholding to allow you to be more strict on feature selection. Currently, I understand that going for higher threshold might net you smaller and more meaningful "good" features list, and possibly reduce some noisy features. "Good" features mean, given the same images with different variation, very similar features are also detected on other images.
However, how high should we go for this thresholding? Sometimes, I see that an image returns no features at all with higher threshold. At first, I was thinking of keep on adjusting the threshold, until I get better MAp. But again, I think it's a bad idea to keep on adjusting just to find the best MAp for the respective database. So my questions are:
While adjusting threshold may decrease numbers of features, does increasing threshold always return a lesser number yet better features?
Are there better approaches to obtain the good features?
What are other factors that can increase the rate of obtaining good features?
Have a look into some of the papers put out in response to the Pascal challenge in recent years. The impression they seem to give me is that standard 'feature detection' methods don't work very well with the Bag of Words technique. This makes sense when you think about it - BoW works by pulling together lots of weak, often unrelated features. It's less about detecting a specific object, but instead recognizing classes of objects and scenes. As such, putting too much emphasis on normal 'key features' can harm more than help.
As such, we see folks using dense grids and even random points as their features. From experience, using one of these methods over Harris corners, LoG, SIFT, MSER, or any of the like, has a great positive impact on performance.
To answer your questions directly:
Yes. From the SIFT api:
Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast. Filtering is controlled by the follow:
Peak threshold. This is the minimum amount of contrast to accept a keypoint. It is set by configuring the SIFT filter object by vl_sift_set_peak_thresh().
Edge threshold. This is the edge rejection threshold. It is set by configuring the SIFT filter object by vl_sift_set_edge_thresh().
You can see examples of the two thresholds in action in the 'Detector parameters' section here.
Research suggests features densely selected from the scene yield more descriptive 'words' than those selected using more 'intelligent' methods (eg: SIFT, Harris, MSER). Try your Bag of Words pipeline with vl_feat's DSIFT or PHOW implementation. You should see a great improvement in performance (assuming your 'word' selection and classification steps are tuned well).
After a dense set of feature points, the biggest breakthrough in this field seems to have been the 'Spatial Pyramid' approach. This increases the number of words produced for an image, but provides a location aspect to the features - something inherently lacking in Bag of Words. After that, make sure your parameters are well tuned (which feature descriptor you're using (SIFT, HOG, SURF, etc), how many words are in your vocabulary, what classifier are you using ect.) Then.. you're in active research land. Enjoy =)

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources