description of SVM file format - machine-learning

I have a data set that is on SVM format.
There is simple of one row:
-1 4:0.0788382 5:0.124138 6:0.117647 11:0.428571 16:0.1 17:0.749633 18:0.843029 19:0.197344 21:0.142856 22:0.142857 23:0.142857 28:1 33:0.0555556 41:0.1 54:1 56:1 64:1 70:1 72:1 74:1 76:1 82:1 84:1 86:1 88:1 90:1 92:1 94:1 96:1 1
Could somebody give a description of this file pls? How to read this format?
Thank you!

This isn't specific to SVM; it's a generic columnar format.
The first entry (-1 in this example) is the label for the observation.
The other entries are pairs of feature_number : value entries.
In your given observation, the label (classification) is -1 (likely "bad event"). The first four features (0-3) have no value. Features 4-6 have the indicated values; 7-10 are missing. This continues through the end of the line. I'm not sure what the trailing 1 value means; this syntax is new to me.

Related

Modify values programmatically SPSS

I have a file with more than 250 variables and more than 100 cases. Some of these variables have an error in decimal dot (20445.12 should be 2.044512).
I want to modify programatically these data, I found a possible way in a Visual Basic editor provided by SPSS (I show you a screen shot below), but I have an absolute lack of knowledge.
How can I select a range of cells in this language?
How can I store the cell once modified its data?
--- EDITED NEW DATA ----
Thank you for your fast reply.
The problem now its the number of digits that number has. For example, error data could have the following format:
Case A) 43998 (five digits) ---> 4.3998 as correct value.
Case B) 4399 (four digits) ---> 4.3990 as correct value, but parsed as 0.4399 because 0 has been removed when file was created.
Is there any way, like:
IF (NUM < 10000) THEN NUM = NUM / 1000 ELSE NUM = NUM / 10000
Or something like IF (Number_of_digits(NUM)) THEN ...
Thank you.
there's no need for VB script, go this way:
open a syntax window, paste the following code:
do repeat vr=var1 var2 var3 var4.
compute vr=vr/10000.
end repeat.
save outfile="filepath\My corrected data.sav".
exe.
Replace var1 var2 var3 var4 with the names of the actual variables you need to change. For variables that are contiguous in the file you may use var1 to var4.
Replace vr=vr/10000 with whatever mathematical calculation you would like to use to correct the data.
Replace "filepath\My corrected data.sav" with your path and file name.
WARNING: this syntax will change the data in your file. You should make sure to create a backup of your original in addition to saving the corrected data to a new file.

How do I transform text into TF-IDF format using Weka in Java?

Suppose, I have following sample ARFF file with two attributes:
(1) sentiment: positive [1] or negative [-1]
(2) tweet: text
#relation sentiment_analysis
#attribute sentiment {1, -1}
#attribute tweet string
#data
-1,'is upset that he can\'t update his Facebook by texting it... and might cry as a result School today also. Blah!'
-1,'#Kenichan I dived many times for the ball. Managed to save 50\% The rest go out of bounds'
-1,'my whole body feels itchy and like its on fire '
-1,'#nationwideclass no, it\'s not behaving at all. i\'m mad. why am i here? because I can\'t see you all over there. '
-1,'#Kwesidei not the whole crew '
-1,'Need a hug '
1,'#Cliff_Forster Yeah, that does work better than just waiting for it In the end I just wonder if I have time to keep up a good blog.'
1,'Just woke up. Having no school is the best feeling ever '
1,'TheWDB.com - Very cool to hear old Walt interviews! ? http://blip.fm/~8bmta'
1,'Are you ready for your MoJo Makeover? Ask me for details '
1,'Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur '
1,'happy #charitytuesday #theNSPCC #SparksCharity #SpeakingUpH4H '
I want to convert the values of second attribute into equivalent TF-IDF values.
Btw, I tried following code but its output ARFF file doesn't contain first attribute for positive(1) values for respective instances.
// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");
// Set the filter
StringToWordVector filter = new StringToWordVector();
filter.setAttributeIndicesArray(new int[]{1});
filter.setOutputWordCounts(true);
filter.setTokenizer(tokenizer);
filter.setInputFormat(inputInstances);
filter.setWordsToKeep(1000000);
filter.setDoNotOperateOnPerClassBasis(true);
filter.setLowerCaseTokens(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);
// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances, filter);
Sample output ARFF file:
#data
{0 -1,320 1,367 1,374 1,397 1,482 1,537 1,553 1,681 1,831 1,1002 1,1033 1,1112 1,1119 1,1291 1,1582 1,1618 1,1787 1,1810 1,1816 1,1855 1,1939 1,1941 1}
{0 -1,72 1,194 1,436 1,502 1,740 1,891 1,935 1,1075 1,1256 1,1260 1,1388 1,1415 1,1579 1,1611 1,1818 2,1849 1,1853 1}
{0 -1,374 1,491 1,854 1,873 1,1120 1,1121 1,1197 1,1337 1,1399 1,2019 1}
{0 -1,240 1,359 2,369 1,407 1,447 1,454 1,553 1,1019 1,1075 3,1119 1,1240 1,1244 1,1373 1,1379 1,1417 1,1599 1,1628 1,1787 1,1824 1,2021 1,2075 1}
{0 -1,198 1,677 1,1379 1,1818 1,2019 1}
{0 -1,320 1,1070 1,1353 1}
{0 -1,210 1,320 2,477 2,867 1,1020 1,1067 1,1075 1,1212 1,1213 1,1240 1,1373 1,1404 1,1542 1,1599 1,1628 1,1815 1,1847 1,2067 1,2075 1}
{179 1,1815 1}
{298 1,504 1,662 1,713 1,752 1,1163 1,1275 1,1488 1,1787 1,2011 1,2075 1}
{144 1,785 1,1274 1}
{19 1,256 1,390 1,808 1,1314 1,1350 1,1442 1,1464 1,1532 1,1786 1,1823 1,1864 1,1908 1,1924 1}
{84 1,186 1,320 1,459 1,564 1,636 1,673 1,810 1,811 1,966 1,997 1,1094 1,1163 1,1207 1,1592 1,1593 1,1714 1,1836 1,1853 1,1964 1,1984 1,1997 2,2058 1}
{9 1,1173 1,1768 1,1818 1}
{86 1,935 1,1112 1,1337 1,1348 1,1482 1,1549 1,1783 1,1853 1}
As you can see that first few instances are okay(as they contains -1 class along with other features), but the last remaining instances don't contain positive class attribute(1).
I mean, there should have been {0 1,...} as very first attribute in the last instances in output ARFF file, but it is missing.
You have to specify which is your class attribute explicitly in the java program as, when you apply StringToWordVector filter your input gets divided among specified n-grams. Hence class attribute location changes once StringToWordVector vectorizes the input. You can just use Reorder filer which will ultimately place class attribute at last position and Weka will pick last attribute as class attribute.
More info about Reordering in Weka can be found at http://weka.sourceforge.net/doc.stable-3-8/weka/filters/unsupervised/attribute/Reorder.html. Also example 5 at http://www.programcreek.com/java-api-examples/index.php?api=weka.filters.unsupervised.attribute.Reorder may help you in doing reordering.
Hope it helps.
Your process for obtaining TF-IDF seems correct.
According to my experiments, if you have n classes, Weka shows information labels for records for each n-1 classes and records for the nth class are implied.
In your case, you have 2 classes -1 and 1, so weka is showing labels in records with class label -1 and records with label 1 are implied.

vb6 - Greater/Less Than statements giving incorrect output

I have a VB6 form with a text boxes for minimum and maximum values. The text boxes have a MaxLength of 4, and I have code for the keyPress event to limit it to numeric entry. The code checks to make sure that max > min, however it is behaving very strangely. It seems to be comparing the values in scientific notation or something. For example, it evaluates 30 > 200 = true, and 100 > 20 = false. However if I change the entries to 030 > 200 and 100 > 020, then it gives me the correct answer. Does anyone know why it would be acting this way?
My code is below, I am using control arrays for the minimum and maximum text boxes.
For cnt = 0 To 6
If ParameterMin(cnt) > ParameterMax(cnt) Then
MsgBox ("Default, Min, or Max values out of range. Line not updated.")
Exit Sub
End If
Next cnt
That is how text comparison behaves for numbers represented as variable length text (in general, not just VB6).
Either pad with zeros to a fixed length and continue comparing as text (as you noted)
OR
(preferable) Convert to integers and then compare.
If I understood correctly, you can alter the code to
If Val(ParameterMin(cnt)) > Val(ParameterMax(cnt)) Then
I wish to advise one thing -(IMHO...) if possible, avoid checking data during key_press/key_up/key_down .
Can you change the GUI to contain a "submit" button and check your "form" there ?
Hope I helped...

adding a big offset to an os.time{} value

I'm writing a Wireshark dissector in lua and trying to decode a time-based protocol field.
I've two components 1)
local ref_time = os.time{year=2000, month=1, day=1, hour=0, sec=0}
and 2)
local offset_time = tvbuffer(0:5):bytes()
A 5-Byte (larger than uint32 range) ByteArray() containing the number of milliseconds (in network byte order) since ref_time. Now I'm looking for a human readable date. I didn't know this would be so hard, but 1st it seems I cannot simple add an offset to an os.time value and 2nd the offset exceeds Int32 range ...and most function I tested seem to truncate the exceeding input value.
Any ideas on how I get the date from ref_time and offset_time?
Thank you very much!
Since ref_time is in seconds and offset_time is in milliseconds, just try:
os.date("%c",ref_time+offset_time/1000)
I assume that offset_time is a number. If not, just reconstruct it using arithmetic. Keep in mind that Lua uses doubles for numbers and so a 5-byte integer fits just fine.

Delta row compression in PCLXL

Is there a difference in the implementation of delta row compression between PCLXL and PCL5?
I was using Delta Row compression in PCL5, but when I used the same method in PCLXL, the file is not valid. I checked the output using EscapeE and it says that the image data size is incorrect..
Could anyone point me to some material explaining how delta row compression is implemented in PCLXL?
Thanks,
kreb
Hmm, I found this and it is indeed different..
from http://www.tek-tips.com/viewthread.cfm?qid=1577259&page=1 user guptadeepak03
Actually I did some research on that too. I found it hard way that there are few differences in the way the formats are in PCL-XL and PCL-5. To quote from the reference manual provided by HP(PCL-XL ver 2.1):
The PCL XL implementation follows the
PCL5 implementation except in the
following:
1) the seed row is
initialized to zeroes and contains the
number of bytes defined by SourceWidth
in the BeginImage operator.
2) the delta row is preceded by a 2-byte byte
count which indicates the number of
bytes to follow for the delta row. The
byte count is expected to be in LSB
MSB order.
3) to repeat the last row, use the 2-byte byte count of 00 00.
Will mark this answered as soon as I can.. Thanks..

Resources