Due to some technical problem all the spaces in all sentences are removed. (except fullstops)
mystring='thisisonlyatest. andhereisanothersentense'
Is there any way in python to get the readable output like this...
"this is only a test. and here is another sentense."
If you have a list of valid common words (can be found on the internet for different languages), you can get all the prefixes, check whether they are a valid word, and recursively repeat with the rest of the sentence. Use memoization to prevent redundant computations on same suffixes.
Here is an example in Python. The lru_cache annotation adds memoization to the function so that the sentence for each suffix is calculated only once, independently of how the first part has been split. Note that words is a set for O(1) lookup. A Prefix-Tree would work very well, too.
words = {"this", "his", "is", "only", "a", "at", "ate", "test",
"and", "here", "her", "is", "an", "other", "another",
"sent", "sentense", "tense", "and", "thousands", "more"}
max_len = max(map(len, words))
import functools
functools.lru_cache(None)
def find_sentences(text):
if len(text) == 0:
yield []
else:
for i in range(min(max_len, len(text)) + 1):
prefix, suffix = text[:i], text[i:]
if prefix in words:
for rest in find_sentences(suffix):
yield [prefix] + rest
mystring = 'thisisonlyatest. andhereisanothersentense'
for text in mystring.split(". "):
print(repr(text))
for sentence in find_sentences(text):
print(sentence)
This will give you a list of valid (but possibly non-sensical) ways to split the sentence into words. Those may be few enough so you an pick the right one by hand; otherwise you might have to add another post-processing step, e.g. using Part of Speech analysis with a proper NLP framework.
Related
I have read the RFC on the ABNF specification and am having difficulty understanding how a set of ABNF rules could be used to reliably extract tokens from some input string that matches the grammar. It seems that the specification doesn't ever mention tokens or ASTs, so it may not concern itself with that, but I believe that would be the ultimate goal of applying any BNF grammar, unless I am mistaken.
In the specification, they list example rules for parsing a postal-address:
postal-address = name-part street zip-part
name-part = *(personal-part SP) last-name [SP suffix] CRLF
name-part =/ personal-part CRLF
personal-part = first-name / (initial ".")
first-name = *ALPHA
initial = ALPHA
last-name = *ALPHA
suffix = ("Jr." / "Sr." / 1*("I" / "V" / "X"))
street = [apt SP] house-num SP street-name CRLF
apt = 1*4DIGIT
house-num = 1*8(DIGIT / ALPHA)
street-name = 1*VCHAR
zip-part = town-name "," SP state 1*2SP zip-code CRLF
town-name = 1*(ALPHA / SP)
state = 2ALPHA
zip-code = 5DIGIT ["-" 4DIGIT]
There is also a list of core rules that I won't post here describing expected common-usage rules.
Ultimately, what I would like to do is figure out the rules necessary for taking the input
John H. Doe
12345 Fakestreet
Springfield, IL 55555
and generating what I believe would be the correct token sequence which is:
["John", " ", "H", ".", "Doe", "\r\n",
"12345", " ", "Fakestreet", "\r\n",
"Springfield", ",", " ", "IL", " ", "55555", "\r\n"]
(I believe the spaces and CRLFs need to be returned as "tokens" because they are specified as requirements in certain rules)
Some problems I am considering:
It makes sense that "Fakestreet" should be its own token, but according to the definition it is a variable repetition of the visible-character core rule. Ideally I would not like to read out each letter as its own token ("F", "a", "k", and so on), so (assuming core-rules can be treated as terminals?) any potential token string would need to be checked against the entire, theoretically infinite, rule definition 1*VCHAR to see if it is a match. And some rules are more complicated than that, like zip-code's 5DIGIT ["-" 4DIGIT], but any potential token needs to be checked against this rule as well ("12345" and "12345-6789" are both valid tokens). So it seems like entire rule element concatenations need to be checked completely as well, unless "12345-6789" should rather be tokenized as ["12345", "-", "6789"] which... may be correct?
I'd assume we would not want to completely check rules that reference other rules, otherwise we may end up tokenizing the entire postal-address as a single token of type "postal-address". Maybe rules that reference other rules shouldn't be checked? Maybe there is such a thing as a "terminal-rule" that includes no rule refs (excluding core rules)?
Occasionally in the rules, terminal values are combined with rule references, for instance in the definition of "personal-part", the literal "." is defined. So, while we may not want to match any potential token string against the entire "personal-part" rule definition, it seems we do want to try to match it against the literal "." because it is a required token for parsing a personal-part. Maybe in non-terminal rules, terminal values listed there should be considered?
I realize this is a lengthy question, but it seems BNF supersets like EBNF and ABNF are being used for this kind of thing but I cannot find a standard specification for how to tokenize from ABNF grammar.
(Sorry for my broken English)
What I'm trying to do is matching a word (with or without numbers and special characters) or whitespace characters (whitespaces, tabs, optional new lines) in a string in Lua.
For example:
local my_string = "foo bar"
my_string:match(regex) --> should return 'foo', ' ', 'bar'
my_string = " 123!#." -- note: three whitespaces before '123!#.'
my_string:match(regex) --> should return ' ', ' ', ' ', '123!#.'
Where regex is the Lua regular expression pattern I'm asking for.
Of course I've done some research on Google, but I couldn't find anything useful. What I've got so far is [%s%S]+ and [%s+%S+] but it doesn't seem to work.
Any solution using the standart library, e.g. string.find, string.gmatch etc. is OK.
Match returns either captures or the whole match, your patterns do not define those. [%s%S]+ matches "(space or not space) multiple times more than once", basically - everything. [%s+%S+] is plain wrong, the character class [ ] is a set of single character members, it does not treat sequences of characters in any other way ("[cat]" matches "c" or "a"), nor it cares about +. The [%s+%S+] is probably "(a space or plus or not space or plus) single character"
The first example 'foo', ' ', 'bar' could be solved by:
regex="(%S+)(%s)(%S+)"
If you want a variable number of captures you are going to need the gmatch iterator:
local capt={}
for q,w,e in my_string:gmatch("(%s*)(%S+)(%s*)") do
if q and #q>0 then
table.insert(capt,q)
end
table.insert(capt,w)
if e and #e>0 then
table.insert(capt,e)
end
end
This will not however detect the leading spaces or discern between a single space and several, you'll need to add those checks to the match result processing.
Lua standard patterns are simplistic, if you are going to need more intricate matching, you might want to have a look at lua lpeg library.
Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)
Suppose I want a function that takes a number and returns it as a string, exactly as it was given. The following doesn't work:
SetAttributes[foo, HoldAllComplete];
foo[x_] := ToString[Unevaluated#x]
The output for foo[.2] and foo[.20] is identical.
The reason I want to do this is that I want a function that can understand dates with dots as delimiters, eg, f[2009.10.20]. I realize that's a bizarre abuse of Mathematica but I'm making a domain-specific language and want to use Mathematica as the parser for it by just doing an eval (ToExpression). I can actually make this work if I can rely on double-digit days and months, like 2009.01.02 but I want to also allow 2009.1.2 and that ends up boiling down to the above question.
I suspect the only answer is to pass the thing in as a string and then parse it, but perhaps there's some trick I don't know. Note that this is related to this question: Mathematica: Unevaluated vs Defer vs Hold vs HoldForm vs HoldAllComplete vs etc etc
I wouldn't rely on Mathematica's float-parsing. Instead I'd define rules on MakeExpression for foo. This allows you to intercept the input, as boxes, prior to it being parsed into floats. This pair of rules should be a good starting place, at least for StandardForm:
MakeExpression[RowBox[{"foo", "[", dateString_, "]"}], StandardForm] :=
With[{args = Sequence ## Riffle[StringSplit[dateString, "."], ","]},
MakeExpression[RowBox[{"foo", "[", "{", args, "}", "]"}], StandardForm]]
MakeExpression[RowBox[{"foo", "[", RowBox[{yearMonth_, day_}], "]"}],
StandardForm] :=
With[{args =
Sequence ## Riffle[Append[StringSplit[yearMonth, "."], day], ","]},
MakeExpression[RowBox[{"foo", "[", "{", args, "}", "]"}], StandardForm]]
I needed the second rule because the notebook interface will "helpfully" insert a space if you try to put a second decimal place in a number.
EDIT: In order to use this from the kernel, you'll need to use a front end, but that's often pretty easy in version 7. If you can get your expression as a string, use UsingFrontEnd in conjunction with ToExpression:
UsingFrontEnd[ToExpression["foo[2009.09.20]", StandardForm]
EDIT 2: There's a lot of possibilities if you want to play with $PreRead, which allows you to apply special processing to the input, as strings, before they're parsed.
$PreRead = If[$FrontEnd =!= Null, #1,
StringReplace[#,x:NumberString /; StringMatchQ[x,"*.*0"] :>
StringJoin[x, "`", ToString[
StringLength[StringReplace[x, "-" -> ""]] -
Switch[StringTake[StringReplace[x,
"-" -> ""], 1], "0", 2, ".", 1, _,
1]]]]] & ;
will display foo[.20] as foo[0.20]. The InputForm of it will be
foo[0.2`2.]
I find parsing and displaying number formats in Mathematica more difficult than
it should be...
Floats are, IIRC, parsed by Mathematica into actual Floats, so there's no real way to do what you want.
Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4