Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string + Mahout - mahout

I am trying to create a file descriptor using the command:
$ MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
from the link:
https://mahout.apache.org/users/classification/partial-implementation.html on my data file but whatever file I take and change the number of attributes string N 3 C 2 N C 4 N C 8 N 2 C 19 N L .
I get the following exception:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string
Please help!

There are a couple of reasons for which you might get an error like that...
Wrong Descriptor: Putting this for a sake of completeness. You must have already checked this one out. You have actually given a wrong descriptor for the data. Re-check the number and type of columns and then give them correctly to the descriptor.
Bad separator: Re-check the delimiter used in the data. That also might create some trouble. May be the data you have has some wrongly placed delimiter in some records. Make sure of that.
Special Characters: In my few experiments, I have noticed mahout does not enjoy if there are certain special characters, or data consists of characters of language other than English (unless of course, you tweak around the code). So make sure you have a way of handling them, and you should be good to go.
Anyways all these fight just so you can create a descriptor of the data. ATB.

Old question, but I had a more acute answer that I discovered after landing here with the same problem.
In this particular case, the problem I found was that the format of data file (from http://nsl.cs.unb.ca/NSL-KDD/) seems to have changed from the example as listed on the Mahout Random Forest example page.
The example lists a line format with the specifier
N 3 C 2 N C 4 N C 8 N 2 C 19 N L
but there's an extra element at the end of the lines; for example:
13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,guess_passwd,2
which has one more field. Adding another number field (N) to the end of the specifier, as
N 3 C 2 N C 4 N C 8 N 2 C 19 N L N
I had luck using just the plain .txt file format instead of the .arff file format.

Related

Computing Variable using SYSMIS

I'm running into a problem that I've never encountered before and I cannot quite figure out what's going on.
I am trying to compute a variable as follows:
COMPUTE A=$SYSMIS.
IF B=$SYSMIS A=$SYSMIS.
IF C > SUM(B, 1) A= 1.
IF C = SUM(B, 1) A= 2.
IF C = B A= 2.
IF C < B A=3.
This runs fine except for the fact that when B=$SYSMIS there are very clear examples of A that are not in fact missing.
I tested it using:
TEMP.
SELECT IF B=$SYSMIS.
FREQ A.
It tells me that "No cases were input to this procedure. Either there are none in the working data file or all of them have been filtered out."
Meaning, the code worked correctly.
But...I found over 1,000 cases that are not fitting this logic.
TEMP.
SELECT IF ID=102.
FREQ A B.
This shows a specific ID that has A=$SYSMIS and B=2.
A, B and C are all numeric.
Thanks in advance for any insight! (:
First, instead of IF B=$SYSMIS you should use if missing(B) - for computing, for analysis and for select.
Another probable reason for your results is in commands like this:
IF C > SUM(B, 1) A=1.
If B is missing, the result of SUM(B, 1) is 1. Therefore if C>1 A gets the value 1, in spite of B being missing.
There are two ways to overcome this.
First, using X+Y instead of sum(X,Y) will result in a missing value when X or Y is missing:
IF C > (B + 1) A=1.
Second option: put the command COMPUTE A=$SYSMIS. at the end of the syntax instead of the beginning, so any values entered in A when B is missing will be replaced with a missing value.

variations in huffman encoding codewords

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?
There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.
The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.

Multiset Partition Using Linear Arithmetic and Z3

I have to partition a multiset into two sets who sums are equal. For example, given the multiset:
1 3 5 1 3 -1 2 0
I would output the two sets:
1) 1 3 3
2) 5 -1 2 1 0
both of which sum to 7.
I need to do this using Z3 (smt2 input format) and "Linear Arithmetic Logic", which is defined as:
formula : formula /\ formula | (formula) | atom
atom : sum op sum
op : = | <= | <
sum : term | sum + term
term : identifier | constant | constant identifier
I honestly don't know where to begin with this and any advice at all would be appreciated.
Regards.
Here is an idea:
1- Create a 0-1 integer variable c_i for each element. The idea is c_i is zero if element is in the first set, and 1 if it is in the second set. You can accomplish that by saying that 0 <= c_i and c_i <= 1.
2- The sum of the elements in the first set can be written as 1*(1 - c_1) + 3*(1 - c_2) + ... +
3- The sum of the elements in the second set can be written as 1*c1 + 3*c2 + ...
While SMT-Lib2 is quite expressive, it's not the easiest language to program in. Unless you have a hard requirement that you have to code directly in SMTLib2, I'd recommend looking into other languages that have higher-level bindings to SMT solvers. For instance, both Haskell and Scala have libraries that allow you to script SMT solvers at a much higher level. Here's how to solve your problem using the Haskell, for instance: https://gist.github.com/1701881.
The idea is that these libraries allow you to code at a much higher level, and then perform the necessary translation and querying of the SMT solver for you behind the scenes. (If you really need to get your hands onto the SMTLib encoding of your problem, you can use these libraries as well, as they typically come with the necessary API to dump the SMTLib they generate before querying the solver.)
While these libraries may not offer everything that Z3 gives you access to via SMTLib, they are much easier to use for most practical problems of interest.

Find all possible pairs between the subsets of N sets with Erlang

I have a set S. It contains N subsets (which in turn contain some sub-subsets of various lengths):
1. [[a,b],[c,d],[*]]
2. [[c],[d],[e,f],[*]]
3. [[d,e],[f],[f,*]]
N. ...
I also have a list L of 'unique' elements that are contained in the set S:
a, b, c, d, e, f, *
I need to find all possible combinations between each sub-subset from each subset so, that each resulting combination has exactly one element from the list L, but any number of occurrences of the element [*] (it is a wildcard element).
So, the result of the needed function working with the above mentioned set S should be (not 100% accurate):
- [a,b],[c],[d,e],[f];
- [a,b],[c],[*],[d,e],[f];
- [a,b],[c],[d,e],[f],[*];
- [a,b],[c],[d,e],[f,*],[*];
So, basically I need an algorithm that does the following:
take a sub-subset from the subset 1,
add one more sub-subset from the subset 2 maintaining the list of 'unique' elements acquired so far (the check on the 'unique' list is skipped if the sub-subset contains the * element);
Repeat 2 until N is reached.
In other words, I need to generate all possible 'chains' (it is pairs, if N == 2, and triples if N==3), but each 'chain' should contain exactly one element from the list L except the wildcard element * that can occur many times in each generated chain.
I know how to do this with N == 2 (it is a simple pair generation), but I do not know how to enhance the algorithm to work with arbitrary values for N.
Maybe Stirling numbers of the second kind could help here, but I do not know how to apply them to get the desired result.
Note: The type of data structure to be used here is not important for me.
Note: This question has grown out from my previous similar question.
These are some pointers (not a complete code) that can take you to right direction probably:
I don't think you will need some advanced data structures here (make use of erlang list comprehensions). You must also explore erlang sets and lists module. Since you are dealing with sets and list of sub-sets, they seems like an ideal fit.
Here is how things with list comprehensions will get solved easily for you: [{X,Y} || X <- [[c],[d],[e,f]], Y <- [[a,b],[c,d]]]. Here i am simply generating a list of {X,Y} 2-tuples but for your use case you will have to put real logic here (including your star case)
Further note that with list comprehensions, you can use output of one generator as input of a later generator e.g. [{X,Y} || X1 <- [[c],[d],[e,f]], X <- X1, Y1 <- [[a,b],[c,d]], Y <- Y1].
Also for removing duplicates from a list of things L = ["a", "b", "a"]., you can anytime simply do sets:to_list(sets:from_list(L)).
With above tools you can easily generate all possible chains and also enforce your logic as these chains get generated.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources