Ordered sequential text matching - machine-learning

I want to match the strings and get a score in the following manner,
string 1: 4556677, string 2: 2556677, score: 0
string 1: 123345873009, string 2: 123345873112, score: 9
string 1: 22334567, string 2: 22334500, score: 6
So the score represents common first n digits, from left to right.
I have a list of 100K string 1 and 30M string 2, I would like to filter down all the pairs (string 1 and 2) with a score greater than 'x'.
Is there an algorithm available to do this task instead of brutal force sequential matching? I have tables stored in apache hive/hbase and would like to implement the approach either in spark or java mapreduce. Any help is much appreciated.

I conclude that your "score" represents the leftmost character position at which the strings differed.
Never mind "mapreduce," plain-Jane Java can do this very easily.
**
public int score( String string1, String string2 ) {
char sbuf1[] = string1.toCharArray();
char sbuf2[] = string2.toCharArray();
int complen = sbuf1.length;
if( sbuf2.length < complen ) {
complen = sbuf2.length;
}
for(
int i = 0; i < complen; i++ ) {
if( sbuf1[ i ] !=
sbuf2[ i ] ) {
return
i;
}
}
return -1; //
indicates no mismatch detected before one string exhausted
}
**

Related

Handling null-safety Exceptions in Dart

In this simple dart exercise, I was trying to handle the exception for the null safety in Dart.
However, I think the code below is not following DRY principles. Will be glad if anyone can correct me.
void Exercise1() {
/// declared these as global variables for availability to all blocks.
int? age;
String? name;
stdout.write('What is your name? ');
name = stdin.readLineSync();
/// This do-while block will validate: name is not empty or less than 3 char.
do {
stdout.write('What is your name? ');
name = stdin.readLineSync();
} while (name == null || name.isEmpty);
/// This de-while block will validate that age is entered and not -ve value.
do {
stdout.write('Hello $name, What is your age? ');
age = int.tryParse(stdin.readLineSync().toString());
} while (age == null || age.isNegative);
/// This expression computes the age left till 100 years.
int? yearsToHundred = 100 - age;
/// This prints out the results.
print(
'\nHey $name, your are $age year of age now. \nYou will be 100 years in another $yearsToHundred years time. \nEnjoy your life to the fullest, Cheers!');
}
What I want here is to validate and check if the number is null, the length is not zero, and only numbers are entered at the prompt. Otherwise, keep looping the input statement till valid input is provided.
void Exercise2() {
String? number;
do {
stdout.write('Enter a number: ');
number = stdin.readLineSync();
} while (number == null || number.length == 0);
int? num = int.parse(number.toString());
if (num.isEven) {
print('Even');
} else {
print('Odd');
}
}

DOORS DXL. How to put each word in a string into separate strings

I want to put each word in a string into a separate string. So if my string has a list of words like, "John, Mary, Barbara" and the words are separate by a carriage return (not a comma as shown in the example), how do I put John into one string, Mary into another string and Barbara into a third string. The strings are not created so I will have to create them on the fly and that is ok. This is what I have tried:
for (n; n<100; n++){
s1 = s[n:n]
if(s1 == "\n") {
break
}
}
Since I want this separation to occur for every object (a specific column in a module) I will have to put whatever the correct code is into a loop like "for o in m do{ }.
Thank you for helping me.
Maybe these functions would help. However, you would have to get familiar with the Skip type (the DXL manual helps with that).
stringSplit() divides str into substrings based on a delimiter, returning an array (Skip type) of these substrings. If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is empty or null, the value of a single space is used. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
Skip stringSplit(string str, string pattern) {
if (null str) return null
if (null(pattern) || 0 == length(pattern))
pattern = " ";
Skip result = create;
int i = 0; // index for searching in str
int j = 0; // index counter for result array
bool found = true;
while (found) {
// find pattern
int pos = 0;
int len = 0;
found = findPlainText(str[i:], pattern, pos, len, true);
if (found) {
// insert into result
put(result, j++, str[i:i+pos-1]);
i += pos + len;
}
}
// append the rest after last found pattern
put(result, j, str[i:]);
return result;
}
You might prefer to remove the commas first since the last word is not followed by a comma.
stringWipe() returns a copy of str with all occurrences of characters in chars eliminated:
string stringWipe(string str, string chars) {
if (null str) return str
int lenStr = length str
if (lenStr == 0) return str
int lenChars = length(chars);
if (lenChars == 0) return str
Buffer buf = create
int i
for (i=0; i<lenStr; i++) {
char c = str[i]
bool skip = false
int j
for (j=0; j<lenChars; j++) {
if (c == chars[j]) {
skip = true
break
}
}
if (skip)
continue
buf += c
}
string result = stringOf(buf);
delete buf
return result
}

Dart converting String to Array then compare two array

I'm trying to convert strings to arrays then compare two arrays. If the same value needs to remove from both array. Then finally merge two arrays and find array length. Below is my code
String first_name = "siva";
String second_name = "lovee";
List<String> firstnameArray=new List();
List<String> secondnameArray=new List();
firstnameArray = first_name.split('');
secondnameArray = second_name.split('');
var totalcount=0;
for (int i = 0; i < first_name.length; i++) {
for (int j = 0; j < second_name.length; j++) {
if (firstnameArray[i] == secondnameArray[j]) {
print(firstnameArray[i] + "" + " == " + secondnameArray[j]);
firstnameArray.removeAt(i);
secondnameArray.removeAt(i);
break;
}
}
}
var finalList = new List.from(firstnameArray)..addAll(secondnameArray);
print(finalList);
print(finalList.length);
But always getting this error Unsupported operation: Cannot remove from a fixed-length list can you help me how to fix this issue. Thanks.
Seems like what you are trying to do is to find the length of unique characters in given two strings. Well, the Set type is perfect for this use-case. Here's an example of what you can do:
void main() {
String first = 'abigail';
String second = 'allie';
var unique = '$first$second'.split('').toSet();
print(unique);
}
This would give you an output of:
{a, b, i, g, l, e}
On which you may perform functions like .toList(), or .where() or .length.
You can ensure that firstnameArray, secondnameArray is not a fixed-length list by initializing it as below:
var firstnameArray = new List<String>.from(first_name.split(''));
var secondnameArray= new List<String>.from(second_name.split(''));
Thereby declaring firstnameArray, secondnameArray to be a mutable copy of input.

Compare string in list

I want to match two string in the same list. I want to get words from a string and insert into list. I want to remove white space and separate by commas. Then I want to check two string in that list whether match or not.
Here is my code:
main() {
List<String> list = new List();
String str = "dog , dog , cat, tiger, lion, cat";
String strn = str.replaceAll(" " , "");
list = strn.split(",");
print(list.length);
print(list);
for (int i=0;i<list.length;i++){
if (list[i] == list[i+1]) {
print("same");
} else{
print("not same");
}
i++;
}
}
here string only check upto length 4. and white space not removed!
I also noticed that in the for loop you are incrementing i twice, the second being close to the bottom. This causes i to skip some of the indexes, so loop looks at index 0, then 2, then 4, then it stops.
I have refactored your solution slightly. I removed the second i++ and changed i < list.length to i < list.length - 1 to skip the last item as list[i + 1] will throw an out of range exception:
main() {
List<String> list = new List();
String str = "dog , dog , cat, tiger, lion, cat";
String strn = str.replaceAll(" ", "");
list = strn.split(",");
print(list.length);
print(list.join('|'));
for(int i=0; i < list.length - 1; i++){
if(list[i] == list[i+1]){
print("same");
}
else{
print("not same");
}
}
}
The result of the loop is so:
same
not same
not same
not same
not same
You can test this out on DartPad

How can I insert a space character in every upper case letter expect the first one at each element of string array in DXL script?

I would like to edit the elements of string array with DXL script which is used in for loop. The problem will be described in the following:
I would like to insert space in front of every upper case letter expect the first one and it would be applied for all lines in string array.
Example:
There is a string array:
AbcDefGhi
GhiDefAbc
DefGhiAbc
etc.
and finally I would like to see the result as:
Abc Def Ghi
Ghi Def Abc
Def Ghi Abc
etc.
Thanks in advance!
Derived straightly from the DXL manual..
Regexp upperChar = regexp2 "[A-Z]"
string s = "yoHelloUrban"
string sNew = ""
while (upperChar s) {
sNew = sNew s[ 0 : (start 0) - 1] " " s [match 0]
s = s[end 0 + 1:]
}
sNew = sNew s
print sNew
You might have to tweak around the fact that you do not want EVERY capital letter to be replaced with , only those that are not at the beginning of your string.
Here's a solution written as a function that you can just drop into your code. It processes an input string character by character. Always outputs the first character as-is, then inserts a space before any subsequent upper-case character.
For efficiency, if processing a large number of strings, or very large strings (or both!), the function could be modified to append to a buffer instead of a string, before finally returning a string.
string spaceOut(string sInput)
{
const int intA = 65 // DECIMAL 65 = ASCII 'A'
const int intZ = 90 // DECIMAL 90 = ASCII 'Z'
int intStrLength = length(sInput)
int iCharCounter = 0
string sReturn = ""
sReturn = sReturn sInput[0] ""
for (iCharCounter = 1; iCharCounter < intStrLength; iCharCounter++)
{
if ((intOf(sInput[iCharCounter]) >= intA)&&(intOf(sInput[iCharCounter]) <= intZ))
{
sReturn = sReturn " " sInput[iCharCounter] ""
}
else
{
sReturn = sReturn sInput[iCharCounter] ""
}
}
return(sReturn)
}
print(spaceOut("AbcDefGHi"))

Resources