Speaker adaptation with HTK - machine-learning

I am trying to adapt a monophone-based recogniser to a specific speaker. I am using the recipe given in HTKBook 3.4.1 section 3.6.2. I am getting stuck on the HHEd part which I am invoking like sp:
HHEd -A -D -T 1 -H hmm15/hmmdefs -H hmm15/macros -M classes regtree.hed monophones1eng
The error I end up with is as follows:
ERROR [+999] Components missing from Base Class list (2413 3375)
ERROR [+999] BaseClass check failed
The folder classes contains the file global which has the following contents:
~b ‘‘global’’
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-25]}
The hmmdefs file within hmm15 had some mixture components (I am using 25 mixture components per state of each phone) missing. I tried to "fill in the blanks" by giving in mixture components with random mean and variance values but zero weigths. This too has had no effect.
The hmms are left-right hmms with 5 states (3 emitting), each state modelled by a 25 component mixture. Each component in turn is modelled by an MFCC with EDA components. There are 46 phones in all.
My questions are:
1. Is the way I am invoking HHEd correct? Can it be invoked in the above manner for monophones?
2. I know that the base class list (rtree.base must contain every single mixture component, but where do I find these missing mixture components?
NOTE: Please let me know in case more information is needed.
Edit 1: The file regtree.hed contains the following:
RN "models"
LS "stats_engOnly_3_4"
RC 32 "rtree"
Thanks,
Sriram

They way you invoke HHEd looks fine. The components are missing as they have become defunct. To deal with defunct components read HTKBook-3.4.1 Section 8.4 page 137.
Questions:
- What does regtree.hed contain?
- How much data (in hours) are you using? 25 mixtures might be excessive.
You might want to use a more gradual increase in mixtures - MU +1 or MU +2 and limit the number of mixtures (a guess: 3-8 depending on training data amount).

Related

Renaming a Telegraf measurement with a wildcard

I'm collecting measurements with a Telegraf client. Unfortunately, the measurement name isn't static. Rather, it encodes a timestamp (terrible design choice, but out of my hands) as part of its name.
For example, the following 3 lines represent 3 instances of the same measurement, but have different names:
info.quorum.2902864.agree: 6
info.quorum.2902865.agree: 6
info.quorum.2902866.agree: 5
...
is there a way to transform these measurement names into one static name? In other words, I'd like to transform these entries above to:
info.quorum.hello.agree: 6
info.quorum.hello.agree: 6
info.quorum.hello.agree: 5
I saw the rename processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/rename) - but that doesn't support wildcard.
I also saw the regex processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/regex) but that doesn't support measurement names.
any ideas on how to get this done?
EDIT: some background: the measurements are collected with http input, usin a GJSON path like a.b.*.c
EDIT2: here's what i'm trying to parse. the problem is with the key '2931747', which grows on each subsequent reading:
"quorum" : {
"2931747" : {
"agree" : 8,
"disagree" : 0,
}
So they put actual value there as the key... Downright, eh, unsmart, let's put it this way.
And I won't blame the writers of JSON format parser for not putting a handle for that situation.
So, the answer is: in current form, with HTTP plugin, available parsers & processors - there's no way to shape it in any proper form (unless you can drop that damned number-key completely - then it's trivial).
I would suggest you to push on data providers to make them stop this stupidity.
If that is not an option - you need to write your own processor for that, alas.
It could be either fully standalone (poll the http endpoint, parse the stuff, form a batch of line protocol records, send to influx) - or it could cut it off on producing lines in Influx line protocol as the its output, and be executed with Exec input plugin

How to change data length parameter in maxima software?

I need to use maxima software to deal with data. I try to read data from a text file constructed as
1 2 3
11 22 33
ect.
Following comands allow for loading data sufficiently.
load(numericalio);
read_matrix("path to the file");
The problem arises when I apply them to a more realistic (larger) data set. In this case the message appears Expression longer than allowed by the configuration setting.
How to overcome this problem? I cannot see any option in configuration menu. I would be grateful for advice.
I ran into the same error message today, at it seems to be related to the size of the output that wxMaxima receives from the Maxima executable.
If you wish to display the output regardless, you can change it in the configuration here:
Edit>Configure>Worksheet>Show long expressions
Note that showing a massive expression or amount of data may dramatically slow the program down, so consider hiding the output (use a $ instead of a ; at the end of your lines) if you don't need to visualize the data.

Text clustering within a log file

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.
I see actually two separate machine learning problems:
1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.
In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false
No you could try to train a classifier which takes your labeled data as an input and the result could be a model.
2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.
Now I would first try a straigt-forward document clustering using a standard nlp pipeline:
Tokenize your document to get the tokens
Normalize them (maybe stemming is not the best idea for logs)
Create for each document a tf-idf vector
Start with a simple clustering algorithm like k-means to try to cluster the different section
After the clustering you should have the section similar to each other in the same cluster
I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

F# - Organisation of algorithms in a file

I do not find a good way to organize various algorithms. Today the file is like this :
1/ Extraction of values from Excel
2/ First algorithm based on these values (extracted from Excel) starting with
"let matriceAlgo1 ="
3/ Second algorithm starting from the same values
"let matriceAlgo2 ="
4/ Synthesis algorithm, doing a weighted average (depending on several values) of the 2/ and 3/ and selecting the result to be shown.
"let matriceSynthesis ="
My question is the following : what should i put before the different parts of this file in order to just call them by there name ? I have seen answers explaining that Module could be an answer but I don't know how to apply it in my case (or anything else if it's not the good answer).At the end, I would like to be able to write something like this :
"launch Extraction
launch First Algorithm
launch Second Algorithm
Launch Synthesis"
The way I usually organize files is to have some clear visual separator between different sections of a file (see for example Crawler.fsx on GitHub) and then have one "main" section at the end that calls functions declared previously.
I don't really use modules unless I have a large number of functions with clashing names. It would be good idea to use modules if your algorithm consists of more functions (e.g. Alg1.initialize, Alg1.run, etc.). Then you could easily switch between using different algorithms using module alias:
module Alg = Alg1 // or Alg2
let a = Alg.initialize
Alg.run a
If the file is getting longer, then you could also move sections to separate files and use #load "File.fs" to load algorithms or functions from a file. In that case, you probably need to use modules, but you can always open the module after loading the file.

How to use libsvm for text classification?

I'd like to write a spam filter program with SVM and I choose libsvm as the tool.
I got 1000 good mails and 1000 spam mails, then I classify them into :
700 good_train mails 700 spam_train mails
300 good_test mails 300 spam_test mails
Then I wrote a program to count the time of each words occur in each file, got result like:
good_train_1.txt:
today 3
hello 7
help 5
...
I learned that libsvm needs format like:
1 1:3 2:1 3:0
2 1:3 2:3 3:1
1 1:7 3:9
as its input. I know that 1, 2, 1 is the label, but what does 1:3 mean?
How could I transfer what I've got to this format?
Likely, the format is
classLabel attribute1:count1 ... attributeN:countN
N is the total number of different words in your text corpus. You will have to check the documentation for the tool you are using(or its sources), to see if you can use a sparser format by not including the attributes having count 0.
How could I transfer what I've got to this format?
Here's how I would do this. I would use the script you've got to compute the count of words for each mail in the training set. Then, use another script and transfer that data into the LIBSVM format that you've shown earlier. (This can be done in a variety of ways, but it should be reasonable to write with an easy input/output language like Python) I would batch all "good-mail" data into one file, and label that class as "1". Then, I would do the same process with the "spam-mail" data and label that class "-1". As nologin said, LIBSVM requires the class label to precede the features, but the features themselves can be any number as long as they are in ascending order, e.g. 2:5 3:6 5:9 is allowed, but not 3:23 1:3 7:343.
If you're concerned that your data is not in the correct format, use their script
checkdata.py
before training and it should report any possible errors.
Once you have two separate files with data in the correct format, you can call
cat file_good file_spam > file_training
and generate a training file that contains data on both good and spam mail. Then, do the same process with the testing set. One psychological advantage with forming the data this way is that you know the top 700 (or 300) mail in the training (or testing) set is good mail, and the remaining are spam mail. This makes it easier to create other scripts you may want to act on the data, such as a precision/recall code.
If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer a few, as well as the various README files that come with installation. (I personally found the READMEs in the "Tools" and "Python" directories to be a great boon.) Sadly, the FAQ does not touch much on what nologin said, about data being in a sparse format.
On a final note, I doubt that you need to keep counts of every possible word that could appear in mail. I would recommend counting only the most common words you would suspect to appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you feel may be helpful.

Resources