I'm trying to figure out how to take a phrase and split it up into a list of separate strings based on the occurrence of certain words.
Examples are probably be the easiest way to explain what I'm hoping to achieve:
List splitters = ['ABOVE', 'AT', 'NEAR', 'IN'];
INPUT: "ALFALFA DITCH IN ECKERT CO";
OUTPUT: ["ALFALFA DITCH", "IN ECKERT CO"];
INPUT: 'ANIMAS RIVER AT DURANGO, CO';
OUTPUT: ['ANIMAS RIVER', 'AT DURANGO, CO'];
INPUT: 'ALAMOSA RIVER ABOVE WILSON CREEK IN JASPER, CO';
OUTPUT ['ALAMOSA RIVER', 'ABOVE WILSON CREEK IN JASPER, CO'];
Notice in the third example, when there are multiple occurrences of splitters in the input phrase, I only want to use the first one.
To my knowledge, the split() method doesn't support multiple strings I can't find a single example of this in dart. I would think there is a simple solution?
I'd use a RegExp then
var splitters = ['ABOVE', 'AT', 'NEAR', 'IN'];
var s = "ALFALFA DITCH IN ECKERT CO";
var splitterRE = RegExp(splitters.join('|'));
var match = splitterRE.firstMatch(s);
if (match ! null) {
var partOne = s.substring(0, match.start).trimRight();
var partTwo = s.substring(match.start);
}
That does what you ask for, but it's slightly unsafe.
It will find "IN" in "BEHIND" if given "BEHIND THE FARM IN ALABAMA".
You likely want to match only complete words. In that case, RegExps are even more helpful, since they can do that too. Change the line to:
var splitterRE = RegExp(r'\b(?:' + splitters.join('|') + r')\b');
then it will only match entire words.
Related
Due to some technical problem all the spaces in all sentences are removed. (except fullstops)
mystring='thisisonlyatest. andhereisanothersentense'
Is there any way in python to get the readable output like this...
"this is only a test. and here is another sentense."
If you have a list of valid common words (can be found on the internet for different languages), you can get all the prefixes, check whether they are a valid word, and recursively repeat with the rest of the sentence. Use memoization to prevent redundant computations on same suffixes.
Here is an example in Python. The lru_cache annotation adds memoization to the function so that the sentence for each suffix is calculated only once, independently of how the first part has been split. Note that words is a set for O(1) lookup. A Prefix-Tree would work very well, too.
words = {"this", "his", "is", "only", "a", "at", "ate", "test",
"and", "here", "her", "is", "an", "other", "another",
"sent", "sentense", "tense", "and", "thousands", "more"}
max_len = max(map(len, words))
import functools
functools.lru_cache(None)
def find_sentences(text):
if len(text) == 0:
yield []
else:
for i in range(min(max_len, len(text)) + 1):
prefix, suffix = text[:i], text[i:]
if prefix in words:
for rest in find_sentences(suffix):
yield [prefix] + rest
mystring = 'thisisonlyatest. andhereisanothersentense'
for text in mystring.split(". "):
print(repr(text))
for sentence in find_sentences(text):
print(sentence)
This will give you a list of valid (but possibly non-sensical) ways to split the sentence into words. Those may be few enough so you an pick the right one by hand; otherwise you might have to add another post-processing step, e.g. using Part of Speech analysis with a proper NLP framework.
I want to find and sort by quantity the most passed 3 words in my UITextView.
For example:
"good good good very very good good. bad bad unfortunately bad."
It must do that:
good (5 times)
bad (3 times)
very (2 times)
How can I do this?
Thanks.
You can use String.components(separatedBy:) to get the words of textView.text, then you can use an NSCountedSet to get the count of each word.
You can of course tweak the separator characters used as an input to components(separatedBy:) to meet your exact criteria.
let textViewText = "good good good very very good good. bad bad unfortunately bad."
//separate the text into words and get rid of the "" results
let words = textViewText.components(separatedBy: [" ","."]).filter({ !$0.isEmpty })
//count the occurrence of each word
let wordCounts = NSCountedSet(array: words)
//sort the words by their counts in a descending order, then take the first three elements
let sortedWords = wordCounts.allObjects.sorted(by: {wordCounts.count(for: $0) > wordCounts.count(for: $1)})[0..<3]
for word in sortedWords {
print("\(word) \(wordCounts.count(for: word))times")
}
Output:
good 5times
bad 3times
very 2times
Here's a one liner that will give you the top 3 words in order of frequency:
let words = "good good good very very good good. bad bad unfortunately bad"
let top3words = Set(words.components(separatedBy:" "))
.map{($0,words.components(separatedBy:$0).count-1)}
.sorted{$0.1 > $01.1}[0..<3]
print(top3words) // [("good", 5), ("bad", 3), ("very", 2)]
It creates a set with each distinct words and then maps each of them with the count of occurrences in the string (words). Finally it sorts the (word,count) tuples on the count and returns the first 3 elements.
[EDIT] the only issues with the above method is that, although it works with your example string, it assumes that no word is contained in another and that they are only separated by spaces.
To do a proper job, the words must first be isolated in an array eliminating any special characters (i.e. non-letters). It may also be appropriate to ignore upper and lower case but you didn't specify that and I dint't want to add to the complexity.
Here's how the same approach would be used on an array of words (produced from the same string):
let wordList = words.components(separatedBy:CharacterSet.letters.inverted)
.filter{!$0.isEmpty}
let top3words = Set(wordList)
.map{ word in (word, wordList.filter{$0==word}.count) }
.sorted{$0.1>$1.1}[0..<3]
I'm trying write some code that looks at two data sets and matches them (if match), at the moment I am using string.find and this kinda work but its very rigid. For example: it works on check1 but not on check2/3, as theres a space in the feed or some other word. i like to return a match on all 3 of them but how can i do that? (match by more than 4 characters, maybe?)
check1 = 'jan'
check2 = 'janAnd'
check3 = 'jan kevin'
input = 'jan is friends with kevin'
if string.find(input.. "" , check1 ) then
print("match on jan")
end
if string.find( input.. "" , check2 ) then
print("match on jan and")
end
if string.find( input.. "" , check3 ) then
print("match on jan kevin")
end
PS: i have tried gfind, gmatch, match, but no luck with them
find only does direct match, so if the string you are searching is not a substring you are searching in (with some pattern processing for character sets and special characters), you get no match.
If you are interested in matching those strings you listed in the example, you need to look at fuzzy search. This SO answer may help as well as this one. I've implemented the algorithm listed in the second example, but got better results with two- and tri-gram matching based on this algorithm.
Lua's string.find works not just with exact strings but with patterns as well. But the syntax is a bit different from what you have in your "checks". You'd want check2 to be "jan.+" to match "jan" followed by one or more characters. Your third check will need to be jan.+kevin. Here the dot stands for any character, while the following plus sign indicates that this might be a sequence of one or more characters. There's more info at http://www.lua.org/pil/20.2.html.
An interesting Google Spreadsheet problem, I have a language file based on key=value that I have copied into a spreadsheet, eg.
titleMessage=Welcome to My Website
youAreLoggedIn=Hello #{user.name} you are now logged in
facebookPublish=Facebook Publishing
I have managed to split the key / value into two columns, and then translate the value column, and re-join it with the keys and Voila! this gives me a translated language file back
But as you may have spotted there are some variable in there (eg. #{user.name}) which are injected by my application, obviously I dont want to translate them.
So here is my question, given the following cell contents...
Hello #{user.name} you are now logged in
Is there a function that will translate the contents using the TRANSLATE function, but ignore anything inside #{ } (this could be at any point in the sentance)
Any Google Spreadsheet guru's have a solution for me?
Many thanks
If there are at most one occurrence of #{} then you could use the SPLIT function to divide the string into three parts that are arranged as below.
A B C D E
Original =SPLIT(An, "#{}") First piece Tag Rest of string
Translate Keep as is Translate
Put the pieces together with CONCATENATE.
=CONCATINATE(Cn,Dn,En)
I come up with same question.
Assume the escape pattern is #{sth.sth}(in regex as #{[\w.]+}). Replace them with string which Google Translate would view as untranslatable term, like VAR.
After translation, replace the term with original pattern.
Here is how I did this in script editor of spreadsheet:
function myTranslate(text, source_language, target_language) {
if(text.toString()) {
var str = text.toString();
var regex = /#{[\w.]+}/g; // g flag for multiple matches
var replace = 'VAR'; // Replace #{variable} to prevent from translation
var vars = str.match(regex).reverse(); // original patterns
str = str.replace(regex, replace);
str = LanguageApp.translate(str, source_language, target_language);
var ret = '';
for (var idx = str.search(replace); idx; idx = str.search(replace)) {
ret += str.slice(0, idx) + vars.pop();
str = str.slice(idx+replace.length);
}
return ret;
}
return null;
}
You can't just split and concatenate, because different languages use different word order of subject/predicate/object etc., and also because several languages modify nouns with different prefixes/suffixes/spelling changes depending on what they are doing in the sentence. It's all very complicated. Google needs to enable some sort of enclosing parentheses around any term we want to be quoted rather than translated.
I have a big text and I'd like to remove everything before a certain string.
The problem is, there are several occurrences of that string in the text, and I want to decide which one is correct by later analyzing the found piece of text.
I can't include that analysis in a regular expression because of its complexity:
text = <<HERE
big big text
goes here
HERE
pos = -1
a = text.scan(/some regexp/im)
a.each do |m|
s = m[0]
# analysis of found string
...
if ( s is good ) # is the right candidate
pos = ??? # here I'd like to have a position of the found string in the text.
end
end
result_text = text[pos..-1]
$~.offset(n) will give the position of the n-th part of a match.
I think you should count how many occurrences there are in your big string then use index to cut off all the occurrences that do not match the final pattern.