I am working on a word game where the user creates words from an ever changing grid of letters. Validating the users selection is easy enough to do using a wordlist.
Since the playing grid is randomly generated and previously played tiles are removed and replaced with new letters, I need to be able to effectively check for possible valid plays between each user submission, so if there are no possible valid words I can reset the grid or something to that affect. The solution only needs to detect that there is at least one valid 3 - 7 letter word within the current set of tiles. It does not need to know ever possible combination. A user can start on any tile and build a word using one tile away in any direction from the currently selected letter.
Important: The solution can't slow the game play down. As soon as the user submits the current word and the new tiles appear they can start a new selection without delay.
Any direction would be greatly appreciated as I have not been able to find what I think I'm looking for with any google searches so far.
Building with Swift for iOS8+
As #jamesp mentioned, a trie is your best bet. You have a couple of advantages for yourself here, the first one being is that you can bail out the second you have found a word. You don't have to scan the whole grid for all possible words, just find one and be done with it. Start with any random letter and look at the ones around it, and match those up against the trie. If you find a match, continue with the letters around that one and so on until you have found a word or a dead end. If the current start tile doesn't have a full word around it, go on to the next tile in the grid and try from there again.
It's going to take quite a bit of processing time to get through a problem like that, especially since the words can twist and turn.
There's a couple of ways to approach that, but if you have a fast dictionary lookup, you'll probably want to step through your puzzle, starting at the upper-left tile, and look at the letter there. Say it is "S". Your dictionary will provide you with a list of acceptable "S" words. You can step through those words looking at the second letter of each word and seeing if there is an adjacent tile to the current tile that has that letter. If not, you're done - move to the next tile. If so, you can do that exact same process again recursively for words starting with "S" and the next letter.
For instance, say the current tile is "S". You'd look up your list of "S" words, and start looping through them. One of them is "Syzygy". If there is no adjacent "Y" tile, you're done - move to the next word. If there is a "Y" tile adjacent, look around that tile for a "Z". If there is none, you're done. Otherwise, move to the second "Y", and so on. (If tiles can only be used once for a word, you may have to remember tiles to exclude from later letters, so that the player can't use the same "Y" three times to spell that "Syzygy".)
I'd wager that the vast majority of your tiles would be excluded quickly with this approach, but it still could take a long time to process, especially if your grid is large. You can address this by running that check in the background, and letting the player continue to play while it's checking, and then show an alert when you finally ascertain that there are no valid plays.
Keep in mind that just because there is one valid word in the puzzle, it doesn't mean that it's still really solvable by the end user. This sort of puzzle isn't like your typical match-three games where if you just look long enough, you'll find the match. Most "valid word" lists are going to contain many words that most people aren't going to know. Words like "propale" and "helctic" and "syzygy". (And you can't exclude words like that, because then when someone finds one, instead of the intense satisfaction of finding an obscure word, they get the intense frustration of "But that's a real word, dang it!")
So, probably what you want is an assessment of how obscure the existing words are. If "dog" is the only word available, that's still probably pretty solvable for most people, whereas if the only words available are "propale" and "helctic" and "syzygy", that's probably impossible for most people, even though there are more words available.
To do that, you'll need to rank your dictionary words as to how common they are, and then add up the ubiquity score for the existing words to make that sort of assessment, and calling the puzzle "unsolvable" if it doesn't reach a certain threshold for that score. Same algorithm, but you'll be adding a score for each word you find. And you can't just quit when you find the first word, but you can quit when you reach that threshold.
If this sounds daunting, then that's because you've designed yourself into a corner. A better approach might be letting the user swap some tiles if they're stumped, or letting them add a wildcard letter, etc. Things that they manage, so that even if there are no real words in the puzzle, they still have strategic options. Then you don't even have to solve this problem, and it solves the deeper problem of knowing whether a puzzle is practically solvable rather than technically solvable.
Why not construct your grid by first putting a randomly selected valid word somewhere on it and then filling in the blank spaces with random letters.
Edit
So if you can't do that, one way that might be quick enough is to organise your words in a trie. This might be enough to make the search fast enough. The idea would be to iterate through the letters in the grid and for each one select the appropriate first letter in the trie. This makes the search for each allowed neighbour smaller and so on until either your trie runs out or you find a word.
I would also select the random letters in a distribution that mirrors the distribution of letters in your dictionary. One way to do this is to
count the number of each letter in your dictionary to give each one a weight
generate a random number between 0 and the total of all the weights
iterate through the letters, subtracting each one's weight from your random number
when the subtraction gets below zero, the letter you are on is the one you want.
You can speed the above up by using a binary search but there's only 26 letters, so the extra complication isn't worth it.
Related
For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.
So I am working on a artist classification project that utilizes hip hop lyrics from genius.com. The problem is these lyrics are user generated, so the same word can be spelled in various different ways, especially if it is slang which is a very common case in hip hop.
I looked into spell correction using hunspell/pyhunspell, but the problem with that is it doesn't fix slang misspellings. I technically could make a mini dictionary with a bunch of misspelled variations but that is effectively useless because there could be a dozen variations of the same word over my (growing) 6000 song corpus.
Any suggestions?
You could try to stem your words. More information on stemming here. This would help grouping together words with close spelling variations.
A popular stemming scheme is the Porter Stemmer, which implementation can be found in most NLP packages, eg. NLTK
I would discard, if possible, short words, or contracted words which somehow are too hard to automatically correct them (conditioned on checking that it won't affect your final result).
For longer words, you may want to use metrics like Levenshtein distance or Jaro similarity. The first one consists of the minimum number of additions, deletes or replaces to convert one candidate word into another. The second one, provides a similar result, between 0 and 1, and putting more emphasis in the last characters of a word.
If you have access to the correct version of your slang word, you could convert the closest candidates to the correct one. Of course, trying not to apply it to different correct words.
If you're working with Python, here some implementations are provided.
I have a list of approx 100,000 names I need to process. Some are business names, some are people names. Unfortunately, some are lower, some are upper, and some are mixed. I am looking for a routine to convert them to proper case. (Sometimes called Mixed or Title case). I realize I can just loop through the string and capitalize every character that starts a new word. That would be an incredibly simplistic approach. For businesses, short words should be lowercase (of, with, for, ...). For last names, if it starts with Mc, the 3rd letter should be capitalized (McDermot, McDonald, etc). Roman numerals should always be capitalized (John Smith II ), etc.
I have not been able to find any Delphi built in, or otherwise, routines. Surely this is out there. Where can I find this?
Thanks
As it was already said by others, making a fully automated routine for this is nearly impossible due to so many special variations. So leaving out the human interaction completely is almost impossible.
Now what you can do instead is to make this much easier for human to solve. How? Make a dictionary of all the name variations in Lowercase and present it to him.
Before presenting the names you can make sure that the first letter in any of the names is already capitalized.
Once all name correction has been made in dictionary you go and automatically replace all the names in original database.
Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.
I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").