With pybrain, it's not possible to use letters as input in a dataset. For example, if I do this:
from pybrain.datasets import ClassificationDataSet
ds = ClassificationDataSet(2)
ds.addSample(('a','b'),1)
I get:
ValueError: could not convert string to float: a
Does it make sense to convert each letter to an integer and make those integers be the features for pybrain? For example, the letter a would be 1 and the letter z would be 26.
My concern with this is that there is 0 relation between letters, and I'm not sure whether a number replacing each position in the string would be incorrectly treated as greater/less quantities of some feature by the neural network.
Related
I have a model which has a column named code, which is a combination of the model's name column and its ID with leading zeros.
name = 'Rocky'
id = 16
I have an after_create callback which runs and generates the code:
update(code: "#{self.name[0..2].upcase}%.4d" % self.id)
The generated code will be:
"ROC0016"
The code is working.
I found (%.4d" % self.id) from another project, but I don't know how it works.
How does it determine the number of zeros to be preceded based on the passed integer.
You’re using a "format specifier". There are many specifiers, but the one you’re using, "%d", is the decimal specifier:
% starts it. 4 means it should always use at least four numbers, so if the number is only two digits, it gets padded with 0s to fill in the rest of the numbers. The second % means replace 4d with whatever comes after it. So in your case, 4d is getting replaced with "0016".
sprintf has more information about format specifiers.
You can read more about String#% in the documentation also.
After the percentage sign ("%") is a decimal (".") and a number. That number is the number of total digits in the result. If the result is less than this value, additional zeros will be added.
Thus, in this first example, the result is "34" but length was set to "4". The result will have two leading zeros to fill it into four digits.
"This is test string %.4d" % 34
result => "This is test string 0034"
"I want more zeroes in my code %.7d" % 34
result => "I want more zeroes in my code 0000034"
I'm creating a Lua script which will calculate a temperature value then format this value as a 4 digit hex number which must always be 4 digits. Having the answer as a string is fine.
Previously in C I have been able to use
data_hex=string.format('%h04x', -21)
which would return ffeb
however the 'h' string formatter is not available to me in Lua
dropping the 'h' doesn't cater for negative answers i.e
data_hex=string.format('%04x', -21)
print(data_hex)
which returns ffffffeb
data_hex=string.format('%04x', 21)
print(data_hex)
which returns 0015
Is there a convenient and portable equivalent to the 'h' string formatter?
I suggest you try using a bitwise AND to truncate any leading hex digits for the value being printed.
If you have a variable temp that you are going to print then you would use something like data_hex=string.format("%04x",temp & 0xffff) which would remove the leading hex digits leaving only the least significant 4 hex digits.
I like this approach as there is less string manipulation and it is congruent with the actual data type of a signed 16 bit number. Whether reducing string manipulation is a concern would depend on the rate at which the temperature is polled.
For further information on the format function see The String Library article.
The Lua manual in section 6.4.1 on Lua Patterns states
A character class is used to represent a set of characters. The
following combinations are allowed in describing a character class:
x: (where x is not one of the magic characters ^$()%.[]*+-?) represents the character x itself.
.: (a dot) represents all characters.
%a: represents all letters.
%c: represents all control characters.
%d: represents all digits.
%g: represents all printable characters except space.
%l: represents all lowercase letters.
%p: represents all punctuation characters.
%s: represents all space characters.
%u: represents all uppercase letters.
%w: represents all alphanumeric characters.
%x: represents all hexadecimal digits.
%x: (where x is any non-alphanumeric character) represents the character x. This is the standard way to escape the magic characters.
Any non-alphanumeric character (including all punctuation characters,
even the non-magical) can be preceded by a % when used to represent
itself in a pattern.
[set]: represents the class which is the union of all characters in set. A range of characters can be specified by separating the end
characters of the range, in ascending order, with a -. All classes
%x described above can also be used as components in set. All other
characters in set represent themselves. For example, [%w_] (or
[_%w]) represents all alphanumeric characters plus the underscore,
[0-7] represents the octal digits, and [0-7%l%-] represents the
octal digits plus the lowercase letters plus the - character.
You can put a closing square bracket in a set by positioning it as the
first character in the set. You can put a hyphen in a set by
positioning it as the first or the last character in the set. (You can
also use an escape for both cases.)
The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.
[^set]: represents the complement of set, where set is interpreted
as above.
For all classes represented by single letters (%a, %c, etc.), the
corresponding uppercase letter represents the complement of the class.
For instance, %S represents all non-space characters.
The definitions of letter, space, and other character groups depend on
the current locale. In particular, the class [a-z] may not be
equivalent to %l.
(Highlighting and some formatting added by me)
So, since the "interaction between ranges and classes is not defined.", how do you create a character class set that starts and/or ends with a (magic) character that needs to be escaped?
For example,
[%%-c]
does not define a character class that ranges from % to c and includes all characters in-between but a set that consists only of the three characters %, -, and c.
The interaction between ranges and classes is not defined.
Obviously, this is not a hard and fast rule (of regex character sets in general) but a Lua implementation decision. While using shorthand characters in character sets/ranges work in some (most) regex flavors, it does not in all (like in Python's re module, demo).
However, the second example is misleading:
Therefore, patterns like [%a-z] or [a-%%] have no meaning.
While the first example is fine since %a is a shorthand class (that represents all letters) in a set, [%a-z] is undefined and will return nil if matched against a string.
Escaped range characters in a [set]
In the second example, [a-%%], %% simply defines an escaped % sign and not a shorthand character class. The superficial problem is, the range is defined upsidedown, from high to low (in reference to the US ASCII value of the characters a 61 and % 37), e.g like an erroneous Lua pattern like [f-a]. If the set is defined in reverse order it seems to work: [%%-a] but all it does is matching the three individual characters instead of the range of characters between % and a; credit cyclaminist).
This could be considered a bug and, indeed, means it is not possible to create a range of characters in a [set] if one of the defining range characters need to be escaped.
Possible Solution
Start the character range from the next character that does not need to be escaped - and then add the remaining escaped characters individually, e.g.
[%%&-a]
Sample:
for w in string.gmatch("%&*()-0Aa", "[%%&-a]") do
print(w)
end
This is the answer I have found. Still, maybe somebody else has something better.
How can I test if a certain character of a string variable is a digit in SPSS (and then apply some operations, depending on the result)?
So let's for example say, I have a variable that reflects the street number. Some street numbers have additional character at the end e.g. "12b". Now let's further assume that I extracted the last character (that could be a digit, or the additional letter) into a string variable. After that I'd like to check if this character is a digit or a letter. How can this be done?
I managed to do this with the MAX function, where "mychar" is the character variable to be checked:
COMPUTE digitcheck = (MAX(mychar,"9")="9").
If the content of "mychar" is a digit [0-9] the result of the MAX function will be "9" otherwise the MAX function will return the letter and the equality test fails.
In this way you can also check if a whole string variable contains a letter or not. It looks pretty ugly though, because you have to compare every single character of your string variable.
compute justdigits = (MAX((CHAR.SUBSTR(mystr,1,1), CHAR.SUBSTR(mystr,2,1), CHAR.SUBSTR(mystr,3,1), ..., CHAR.SUBSTR(mystr,n,1),"9")="9").
If you try to turn a letter into a number then it becomes a missing value. Therefore, to test whether a character is a digit, you can do this:
if not missing(number(YourCharacter,f1)) .....
The same test can determine whether a string has only a number in it or not:
compute OnlyNumber=(not missing(number(YourString,f10))).
Note: using the number command on strings will produce warning messages which you can of course ignore.
So I have entered my second semester of College and they have me doing a course called Advanced COBOL. As one of my assignments I have to my make a program that tests certain things in a file to make sure the input has no errors. I get the general idea but there are just a few things I don't understand and my teacher is one of those people who will give you an assignment and make you figure it out yourself with little or no help. So here is what I need help with.
I have a field that the first 5 columns have to be numbers, the 6th column a capital letter and the last 2 numbers in a range of 01-68 or 78-99.
one of my fields has to be a string of numbers with a dash in it like 00000-000, but some have more than one dash. How can I count the dashes to identify that there is a problem.
Here are a few hints...
Use a hieratical record structure to view the data in different ways. For example:
01 ITEM-REC.
05 ITEM-CODE.
10 ITEM-NUM-CODE PIC 9(3).
10 ITEM-CHAR-CODE PIC A(3).
88 ITEM-TYPE-A VALUE 'AAA' THRU 'AZZ'.
88 ITEM-TYPE-B VALUE 'BAA' THRU 'BZZ'.
05 QUANTITY PIC 9(4).
ITEM-CODE is a 6 character group field, the first part of which is numeric (ITEM-NUM-CODE) and the last part
is alphabetic (ITEM-CHAR-CODE). You can refer to any one of these three variables in your program. When you
refer to ITEM-CODE, or any other group item, COBOL
treats the variable as if it were declared as PIC X. This means you can
MOVE just about anything into it without raising an error. For example:
MOVE 'ABCdef' TO ITEM-CODE
or
MOVE 'ABCdef0005' TO ITEM-REC
Neither one would cause an error even though the elementary data item ITEM-NUM-CODE is definitely not a number.
To verify the validity
of your data after a group move you should validate each elementary data item separately (unless
you know for certain no data type errors could have occurred). There are a variety of ways to do this. For
example if the data item has to be numeric the following would work:
IF ITEM-NUM-CODE IS NUMERIC
CONTINUE
ELSE
DISPLAY 'ITEM-NUM-CODE IS NOT NUMERIC'
PERFORM BIG-BAD-ERROR
END-IF
COBOL provides various class tests which can be applied against a data item. For
example: NUMERIC, ALPHABETIC and ALPHANUMERIC are commonly used.
Another common way to test for ranges of values is by defining various 88 levels - but exercise
caution. In the above
example ITEM-TYPE-A is an 88 level that defines a data range from 'AAA' through 'AZZ' based on
the collating sequence currently in effect. To verify that ITEM-CHAR-CODE contains only alphabetic
characters and the first letter is an 'A' or a 'B', you could do something like:
IF ITEM-CHAR-CODE ALPHABETIC
DISPLAY 'ITEM-CHAR-CODE is alphabetic.'
EVALUATE TRUE
WHEN ITEM-TYPE-A
DISPLAY 'ITEM-CHAR-CODE is in range AAA through AZZ'
WHEN ITEM-TYPE-B
DISPLAY 'ITEM-CHAR-CODE is in range BAA through BZZ'
WHEN OTHER
DISPLAY 'ITEM-CHAR-CODE is in some other range'
END-EVALUATE
ELSE
DISPLAY 'ITEM-CHAR-CODE is not alphabetic'
END-IF
Note the separate test for ALPHABETIC above. Why do that when the 88 level tests
could have done the job? Actually the 88's are not sufficient because they
cover the entire range from AAA through AZZ based on the collating sequence currently
in effect. In
an EBCDIC based environment (a very large number of COBOL shops use EBCDIC) this captures
values such as A}\. the close-brace and backslash characters are non-alpha but
fall into the middle of
the range 'A' through 'Z' (what the #*#! is that all about?). Also note that a value such
as 'aaa' would not satisfy the ITEM-TYPE-A condition because lower case letters fall outside
the defined range. Maybe time to check out an EBCDIC character table.
Finally, you can count the number of occurrences of a character, or string of characters, in
a variable with the INSPECT verb as follows:
INSPECT ITEM-CODE TALLING DASH-COUNT FOR ALL '-'
DASH-COUNT needs to be a numeric item and will contain the number of dash characters in ITEM-CODE. The INSPECT
verb is not so useful if you want to count the number of digits. For this you would need one statement for each digit.
It might be easier to just code a loop something like:
PERFORM VARYING I FROM 1 BY 1
UNTIL I > LENGTH OF ITEM-CODE
EVALUATE ITEM-CODE(I:1)
WHEN '-'
COMPUTE DASH-COUNT = DASH-COUNT + 1
WHEN '0' THRU '9'
COMPUTE DIGIT-COUNT = DIGIT-COUNT + 1
WHEN OTHER
COMPUTE OTHER-COUNT = OTHER-COUNT + 1
END-EVALUATE
END-PERFORM
Now ask yourself why I was comfortable using a zero through 9 range check? Hint: look at the collating sequence.
Hope this helps.