Is there an alphabetical character check in prolog? - character-encoding

Greetings,
Is there was test or a predicate I can use in prolog to verify that a certain given character is alphabetical? Right now, what I'm doing is:
List of unallawed characters: \n -> 10, space -> 32, !->33, .->46, ,->44, :->58, ;->59%
% 63->? , 45 -> -, 34->", 39-> %
\+member(Ch,[10, 32, 33, 34, 39, 44, 45, 46, 58, 59, 63 ]), %Checking for line return (\n), space, punctuations
Those are only a few of the characters I need to check for. having a test such as letter(Ch). would save me a great deal of time, and above all be a way more defensive approach.
Thank you

is_alpha/1
There are also other predicates such as is_lower/1 etc.

In SWI-Prolog, this is done with char_type/2 such as
% X is either a single-character atom or a character code
alphabetical(X) :- char_type(X, alpha).
SWI-Prolog also offers the ctypes library which provides is_alpha, etc.
:- use_module(library(ctypes)).
alphabetical(X) :- is_alpha(X).

Related

Ruby calling a Python script causes encoding error with German characters

A Ruby on rails application launches a Python script to get German word lemmas. The Python scripts exits with the following error:
File "/PATHTOSCRIPT/script.py", line 15, in <module>
for l in sys.stdin:
File "/PATHTOPYTHON/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Ruby on Rails:
require 'open3'
#in, #out, stderr = Open3.popen3("/PATHTOSCRIPT/script.py") if ['de'].include? lang
a = "übervölkerung"
#in.write "#{a}\n"
logger.info(#treetagger_out.read.nil?)
logger.info(stderr.read)
Python:
import sys
import os
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', buffering=1)
for l in sys.stdin:
l = l.strip()
I've detected that in Ruby and Python there are different character counts:
Ruby:
2.2.3 :006 > a="übervölkerung"
=> "übervölkerung"
2.2.3 :007 > print a.bytes
[195, 188, 98, 101, 114, 118, 195, 182, 108, 107, 101, 114, 117, 110, 103] => nil
Python:
>>> a="übervölkerung"
>>> print(list(map(ord, a)))
[252, 98, 101, 114, 118, 246, 108, 107, 101, 114, 117, 110, 103]
The input to your Python script is apparently text encoded with UTF-8.
If you encode your test string "übervölkerung" with UTF-8, then the first byte is C3, which is found in the traceback in the beginning of your post.
This means you need to read STDIN with a text stream that decodes UTF-8, not ASCII.
You already have a line that creates a wrapper around sys.stdin:
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', buffering=1)
This replaces the default text stream reader (an io.TextIOWrapper instance) with a new one.
But you don't specify the input encoding, so the default encoding is used – which is determined by the environment (based on OS-specific environment variables).
In your case the encoding apparently defaults to ASCII, which is not what you need.
You need UTF-8, so write:
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', encoding='UTF-8')
(Of course you can leave the buffering=1 parameter there if you think you need it.)
Also, os.fdopen is just a more restricted version of the built-in open function. So you can just use that one, without losing anything:
sys.stdin = open(sys.stdin.fileno(), 'r', encoding='UTF-8')
By the way, the difference in character count you see between Ruby and Python comes from the fact that you are looking at different things.
In the Ruby code, you look at the bytes of the UTF-8 encoded text, while in Python you look at the (Unicode) code points.
In the second case, each number corresponds to a single character, while multiple numbers correspond to a character in the first case.
To see the byte values in Python, do:
>>> a = "übervölkerung"
>>> list(a.encode('utf8'))
[195, 188, 98, 101, 114, 118, 195, 182, 108, 107, 101, 114, 117, 110, 103]
I don't know how to see code points in Ruby, though.

How does the control sequence ~s work?

The following output is as I expected:
125> [97, 98, 99].
"abc"
126> [97, 98, 0].
[97,98,0]
But the output using ~s is not what I expected:
127> io:format("~s~n", [[97, 98, 0]]).
ab^#
ok
How do I interpret that output?
The ~s control sequence expects to get a string, a binary or an atom, and prints it "with string syntax". As Erlang strings are just lists of integers, it tries to print [97, 98, 0] in this example as a string as well. The shell on the other hand tries to guess whether this list of integers is meant to be a string.
^# represents the NUL character. You might be familiar with the caret notation, where ^A represents byte 1, since A is the 1st letter in the alphabet - or in other words, it represents the byte whose value is 64 less than the ASCII value of the character, since A is 65 in ASCII. Extrapolate it to the 0 byte, and you'll find #, which is 64 in ASCII.

How can I create a bag of words for latex strings?

I have a set of input paragraphs in latex formats. I want to create a bag of words from them.
Taking a set of guys that look like these:
"Some guy did something with \emph{ yikes } $ \epsilon $"
I want to out put a dictionary:
{
"Some": 40,
...
"yikes": 10
"epsilon (or unicode for it)": 3
}
That is I need a dictionary where the set of keys are the set of words/symbols/equations (I'll call all of these words for brevity) across all paragraphs and a count of their occurrences across all paragraphs as well.
From there given k-ordered-tuple of words, I need a k-array for each paragraph where the ith element in the array represents the count of the word in the ith tuple in that paragraph.
so say (Some, dunk, yikes, epsilon) will give me
[1, 0, 1, 1] for the stated example.
I've tried this by using a lexer to get the tokens out and processing the tokens directly. This is difficult and error prone not to mention slow. Is there a better strategy or tool that can do this?
There are some corner cases to consider with special characters:
G\""odel => Gödel
for example. I'd like to preserve these.
Also, I'd like to drop equations all together or keep them as one word. Equations occur in between $ ... $ signs.
If I understand correctly, you are trying to do the following:
Split the sentence into words:
s = "Some guy did something with \emph{ yikes } \epsilon"
words = s.split()
print words
Output:
['Some', 'guy', 'did', 'something', 'with', '\\emph{', 'yikes', '}', '\\epsilon']
Count the number of occurrences:
from collections import Counter
dictionary = Counter(words)
print dictionary
Output:
Counter({'did': 1, '}': 1, '\\epsilon': 1, 'Some': 1, 'yikes': 1, 'something': 1, 'guy': 1, 'with': 1, '\\emph{': 1})
Access words and their corresponding numbers as separate lists:
print dictionary.keys()
print dictionary.values()
Output:
['did', '}', '\\epsilon', 'Some', 'yikes', 'something', 'guy', 'with', '\\emph{']
[1, 1, 1, 1, 1, 1, 1, 1, 1]
Note that I didn't process any word, yet. You might want to strip brackets or backslashes. But this can be easily done by traversing the dictionary (or lists) with a for-loop and handling each entry individually.
To convert LaTeX umlauts to unicode characters is somehow a whole new problem. There are several stackoverflow questions and answers on this topic. Maybe you just need to find/replace them in the initial string:
s = s.replace('\\"o', unichr(252))
(Note that depending on your command line encoding you might not see umlauts with print s. But they are not lost, as can be shown using print repr(s).)
To preserve equations you can split the string using a regular expression rather than split:
import re
print re.findall('\$.+\$|[\w]+', s)
Output:
['Some', 'guy', 'did', 'something', 'with', 'emph', 'yikes', '$ \\epsilon $']
Please see my answer to another question for a similar example and a more detailed explanation.

Why is [9] returned as "\t" in Erlang?

I am working through some Erlang tutorials and noticed when I enter
[8].
the VM returns "\b"
or if I enter
[9].
the VM returns "\t"
I am confused on why this is happening. Other numbers are returned as a list of that number:
[3].
is returned as [3]
[4].
is returned as [4], etc.
I guess the question is why is the erlang VM return it this way? Perhaps an explanation of a list [65] and a list? "A".
Another related item is confusing as well:
Type conversion, converting a list to an integer is done as:
list_to_integer("3").
Not
list_to_integer([3]).
Which returns an error
In Erlang there are no real strings. String are a list of integers. So if you give a list with integers that represent characters then they will be displayed as a string.
1> [72, 101, 108, 108, 111].
"Hello"
If you specify a list with at least element that does not have a character counterpart, then the list will be displayed as such.
2> [72, 101, 108, 108, 111, 1].
[72,101,108,108,111,1]
In Erlang strings are lists and the notation is exactly the same.
[97,98,99].
returns "abc"
The following excerpt is taken directly from "Learn You Some Erlang for Great Good!", Fred Hébert, (C)2013, No Starch Press. p. 18.
This is one of the most disliked thins in Erlang: strings. Strings are lists, and the notation is exactly the same. Why do people dislike it?
Because of this:
3> [97,98,99,4,5,6].
[97,98,99,4,5,6]
4> [233].
"é"
Erlang will print lists of numbers as numbers only when at least one of them could not also represent a letter. There is no such thing as a real string in Erlang!
"Learn You Some Erlang for Great Good!" is also available online at: http://learnyousomeerlang.com/
kadaj answered your first question. Regarding the second one about list_to_integer, if you look at the documentation, most list_to_XXX functions except binary, bitstring, and tuple consider their argument as a string. Calling them string_to_XXX could be clearer, but changing the name would break a lot of code.

Poker hand range parser ... how do I write the grammar?

I'd like to build a poker hand-range parser, whereby I can provide a string such as the following (assume a standard 52-card deck, ranks 2-A, s = suited, o = offsuit):
"22+,A2s+,AKo-ATo,7d6d"
The parser should be able to produce the following combinations:
6 combinations for each of 22, 33, 44, 55, 66, 77, 88, 99, TT, JJ, KK, QQ, AA
4 combinations for each of A2s, A3s, A4s, A5s, A6s, A7s, A8s, A9s, ATs, AJs, AQs, AKs
12 combinations for each of ATo, AJo, AQo, AKo
1 combination of 7(diamonds)6(diamonds)
I think I know parts of the grammar, but not all of it:
NM+ --> NM, N[M+1], ... ,N[N-1]
NN+ --> NN, [N+1][N+1], ... ,TT where T is the top rank of the deck (e.g. Ace)
NP - NM --> NM, N[M+1], ... ,NP
MM - NN --> NN, [N+1][N+1], ..., MM
I don't know the expression for the grammar for dealing with suitedness.
I'm a programming newbie, so forgive this basic question: is this a grammar induction problem or a parsing problem?
Thanks,
Mike
Well you should probably look at EBNF to show your grammar in a widely accepted manner.
I think it would look something like this:
S = Combination { ',' Combination } .
Combination = Hand ['+' | '-' Hand] .
Hand = Card Card ["s" | "o"] .
Card = rank [ color ] .
Where {} means 0 or more occurences, [] means 0 or 1 occurence and | means either whats left of | or whats right of |.
So basically what this comes down to is a start symbol (S) that says that the parser has to handle from 1 to any number of combinations that are all separated by a ",".
These combinations consist of a description of a card and then either a "+", a "-" and another card description or nothing.
A card description consists of rank and optionally a color (spades, hearts, etc.). The fact that rank and color aren't capitalized shows that they can't be further divided into subparts (making them a terminal class).
My example doesn't provide the offsuite/suite possibility and that is mainly because in you're examples one time the o/s comes at the very end "AK-ATo" and one time in the middle "A2s+".
Are these examples your own creation or are they given to you from an external source (read: you can't change them)?
If you can change them I would strongly recommend placing those at one specified position of a combination (for example at the end) to make creating the grammar and ultimately the parsing a lot easier.

Resources