Text clustering within a log file - machine-learning

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.

I see actually two separate machine learning problems:
1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.
In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false
No you could try to train a classifier which takes your labeled data as an input and the result could be a model.
2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.
Now I would first try a straigt-forward document clustering using a standard nlp pipeline:
Tokenize your document to get the tokens
Normalize them (maybe stemming is not the best idea for logs)
Create for each document a tf-idf vector
Start with a simple clustering algorithm like k-means to try to cluster the different section
After the clustering you should have the section similar to each other in the same cluster
I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

Related

Applying MACHINE learning in biological text data

I am trying to solve the following question - Given a text file containing a bunch of biological information, find out the one gene which is {up/down}regulated. Now, for this I have many such (60K) files and have annotated some (1000) of them as to which gene is {up/down}regulated.
Conditions -
Many sentences in the file have some gene name mention and some of them also have neighboring text that can help one decide if this is indeed the gene being modulated.
Some files also have NO gene modulated. But these still have gene mentions.
Given this, I wanted to ask (having absolutely no background in ML), what sequence learning algorithm/tool do I use that can take in my annotated (training) data (after probably converting the text to vectors somehow!) and can build a good model on which I can then test more files?
Example data -
Title: Assessment of Thermotolerance in preshocked hsp70(-/-) and
(+/+) cells
Organism: Mus musculus
Experiment type: Expression profiling by array
Summary: From preliminary experiments, HSP70 deficient MEF cells display moderate thermotolerance to a severe heatshock of 45.5 degrees after a mild preshock at 43 degrees, even in the absence of hsp70 protein. We would like to determine which genes in these cells are being activated to account for this thermotolerance. AQP has also been reported to be important.
Keywords: thermal stress, heat shock response, knockout, cell culture, hsp70
Overall design: Two cell lines are analyzed - hsp70 knockout and hsp70 rescue cells. 6 microarrays from the (-/-)knockout cells are analyzed (3 Pretreated vs 3 unheated controls). For the (+/+) rescue cells, 4 microarrays are used (2 pretreated and 2 unheated controls). Cells were plated at 3k/well in a 96 well plate, covered with a gas permeable sealer and heat shocked at 43degrees for 30 minutes at the 20 hr time point. The RNA was harvested at 3hrs after heat treatment
Here my main gene is hsp70 and it is down-regulated (deducible from hsp(-/-) or HSP70 deficient). Many other gene names are also there like AQP.
There could be another file with no gene modified at all. In fact, more files have no actual gene modulation than those who do, and all contain gene name mentions.
Any idea would be great!!
If you have no background in ML I suggest buying a product like this one, this one or this one. These products where in development for decades with team budgets in millions.
What you are trying to do is not that simple. For example a lot of papers contain negative statements by first citing the original statement from another paper and then negating it. In your example how are you going to handle this:
AQP has also been reported to be important by Doe et al. However, this study suggest that this might not be the case.
Also, if you are looking into large corpus of biomedical research papers, or for this matter any corpus of research papers. You will find tons of papers that suggest something for example gene being up-regulated or not, and then there is one paper published in Cell magazine that all previous research has been mistaken.
To make matters worse, gene/protein names are not that stable. Besides few famous ones like P53. There is a bunch of run of the mill ones that are initially thought that they are one gene, but later it turns out that these are two different things. When this happen there are two ways community handles it. Either both of the genes get new names (usually with some designator at the end) or if the split is uneven the larger class retains original name and the second one gets the new name. To compound this problem, after this split happens not all researchers get the memo at instantly, so there is still stream of publications using old publication.
These are just two simple problems, there are 100s of these.
If you are doing this for personal enrichment. Here are some suggestions:
Build a language model on biomedical papers. Existing language models are usually built from news-wire sources or from social media data. All three of the corpora claim to be written in English language. But in reality these are three different languages with their own grammar and vocabulary
Look into things like embeddings and word2vec.
Look into Kaggle competitions, this is somewhat popular topic there.
Subscribe to KDD and BIBM magazines or find them in nearby library. There are 100s of papers on this subject.

Multi-variable Recommender System

I went through tutorials on implementing Recommender System and most of them takes one variable (rank).
I want to implement an Item-Based Recommender System which takes multiple variables.
Eg : Let's say an Item (bar) has following varables (values ranging from -10 to +10, to express opposite polarities)
- price (cheap to expensive)
- environment (casual to fine)
- age range (young to adults)
Now I want to recommend items (bar) looking at the list of bars registered in User's history.
Is this kind of a "multi dimensional recommender system" possible to implement using Mahout or any other framework ?
You want the multi-modal, multi-indicator, multi-variable, how every you want to describe it—Universal Recommender. It can handle all this data. We've tested it on real datasets and get significant boost in precision test because of what we call "secondary indicators".
Good intuition. Give the UR a look: blog.actionml.com, check out the slides in one post. Code here: https://github.com/actionml/template-scala-parallel-universal-recommendation/tree/v0.3.0 Built on the new Spark version of Mahout: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

In Weka make a arff file from text file

In naive byes classifier i want to find out the accuracy from my train and test. But my train set is like
Happy: absolution abundance abundant accolade accompaniment accomplish accomplished achieve achievement acrobat admirable admiration adorable adoration adore advance advent advocacy aesthetics affection affluence alive allure aloha
Sad: abandon abandoned abandonment abduction abortion abortive abscess absence absent absentee abuse abysmal abyss accident accursed ache aching adder adrift adultery adverse adversity afflict affliction affront aftermath aggravating
Angry: abandoned abandonment abhor abhorrent abolish abomination abuse accursed accusation accused accuser accusing actionable adder adversary adverse adversity advocacy affront aftermath aggravated aggravating aggravation aggression aggressive aggressor agitated agitation agony alcoholism alienate alienation
For test set
data: Dec 7, 2014 ... This well-known nursery rhyme helps children practice emotions, like happy, sad, scared, tired and angry. If You're Happy and You Know It is ...
Now the problem is how do i convert them into arff file
Your training set is not appropriate for training a model for Weka however these information can be used in feature extraction.
Your Test set can be converted into an arff file. From every message extract these basics features like
1. Any form of the word 'Happy' is present or not
2. Any form of the word 'Sad' is present or not
3. Any form of the word 'Angry' is present or not
4. TF-IDF
etc.
then for some messages (say 70%) you should assign one class {Happy, Sad, Angry} manually and for remaining 30% you can test through your model.
More about arff file is given here:
http://www.cs.waikato.ac.nz/ml/weka/arff.html
Where to start ;).
As written before your "training data" is not real training data. Training data should be texts similar to the data you are using for Testing. However, in your example it is merely a list of words. My gut feeling is that you would be better of to avoid using weka, count the number of occurrence in each category, and the take the one with most matches.
In case you want use Weka I'd recommend to use the toolbox https://www.knime.org which nicely integrates with weka.
You should then convert your data into a bag of words representation. This is basically you have the number of times each word occurs in each of the texts as features.
Also for this Knime has nice package. http://www.tech.knime.org/files/KNIME-TextProcessing-HowTo.pdf

Speaker adaptation with HTK

I am trying to adapt a monophone-based recogniser to a specific speaker. I am using the recipe given in HTKBook 3.4.1 section 3.6.2. I am getting stuck on the HHEd part which I am invoking like sp:
HHEd -A -D -T 1 -H hmm15/hmmdefs -H hmm15/macros -M classes regtree.hed monophones1eng
The error I end up with is as follows:
ERROR [+999] Components missing from Base Class list (2413 3375)
ERROR [+999] BaseClass check failed
The folder classes contains the file global which has the following contents:
~b ‘‘global’’
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-25]}
The hmmdefs file within hmm15 had some mixture components (I am using 25 mixture components per state of each phone) missing. I tried to "fill in the blanks" by giving in mixture components with random mean and variance values but zero weigths. This too has had no effect.
The hmms are left-right hmms with 5 states (3 emitting), each state modelled by a 25 component mixture. Each component in turn is modelled by an MFCC with EDA components. There are 46 phones in all.
My questions are:
1. Is the way I am invoking HHEd correct? Can it be invoked in the above manner for monophones?
2. I know that the base class list (rtree.base must contain every single mixture component, but where do I find these missing mixture components?
NOTE: Please let me know in case more information is needed.
Edit 1: The file regtree.hed contains the following:
RN "models"
LS "stats_engOnly_3_4"
RC 32 "rtree"
Thanks,
Sriram
They way you invoke HHEd looks fine. The components are missing as they have become defunct. To deal with defunct components read HTKBook-3.4.1 Section 8.4 page 137.
Questions:
- What does regtree.hed contain?
- How much data (in hours) are you using? 25 mixtures might be excessive.
You might want to use a more gradual increase in mixtures - MU +1 or MU +2 and limit the number of mixtures (a guess: 3-8 depending on training data amount).

How to use libsvm for text classification?

I'd like to write a spam filter program with SVM and I choose libsvm as the tool.
I got 1000 good mails and 1000 spam mails, then I classify them into :
700 good_train mails 700 spam_train mails
300 good_test mails 300 spam_test mails
Then I wrote a program to count the time of each words occur in each file, got result like:
good_train_1.txt:
today 3
hello 7
help 5
...
I learned that libsvm needs format like:
1 1:3 2:1 3:0
2 1:3 2:3 3:1
1 1:7 3:9
as its input. I know that 1, 2, 1 is the label, but what does 1:3 mean?
How could I transfer what I've got to this format?
Likely, the format is
classLabel attribute1:count1 ... attributeN:countN
N is the total number of different words in your text corpus. You will have to check the documentation for the tool you are using(or its sources), to see if you can use a sparser format by not including the attributes having count 0.
How could I transfer what I've got to this format?
Here's how I would do this. I would use the script you've got to compute the count of words for each mail in the training set. Then, use another script and transfer that data into the LIBSVM format that you've shown earlier. (This can be done in a variety of ways, but it should be reasonable to write with an easy input/output language like Python) I would batch all "good-mail" data into one file, and label that class as "1". Then, I would do the same process with the "spam-mail" data and label that class "-1". As nologin said, LIBSVM requires the class label to precede the features, but the features themselves can be any number as long as they are in ascending order, e.g. 2:5 3:6 5:9 is allowed, but not 3:23 1:3 7:343.
If you're concerned that your data is not in the correct format, use their script
checkdata.py
before training and it should report any possible errors.
Once you have two separate files with data in the correct format, you can call
cat file_good file_spam > file_training
and generate a training file that contains data on both good and spam mail. Then, do the same process with the testing set. One psychological advantage with forming the data this way is that you know the top 700 (or 300) mail in the training (or testing) set is good mail, and the remaining are spam mail. This makes it easier to create other scripts you may want to act on the data, such as a precision/recall code.
If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer a few, as well as the various README files that come with installation. (I personally found the READMEs in the "Tools" and "Python" directories to be a great boon.) Sadly, the FAQ does not touch much on what nologin said, about data being in a sparse format.
On a final note, I doubt that you need to keep counts of every possible word that could appear in mail. I would recommend counting only the most common words you would suspect to appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you feel may be helpful.

Resources