Neural Network String Normalization - machine-learning

I am attempting to form a product string normalization algorithm and am curious if this can be modeled using a neural network.
For example, I have the following input and target output data from eBay:
Input String 1: Brand New, Air Jordan 11 Retro XL Space Jam Size 10-US, Free Ship!
Target Output String 1: Space Jam Nike Air Jordan
Input String 2: Air Jordan 11 Retro XI 2016 Nike Space Jam 45 Men AJ11 DS Monstars 378037-003
Target Output String 2: Space Jam Nike Air Jordan
Input String 3: 2016 Nike Air Jordan Retro XI Space Jam US 8.5 - 9.5 IN HAND 11 378037 003 OG
Target Output String 3: Space Jam Nike Air Jordan
Input String 4: Air Jordans Nike 11 Retro XI 2016 45 Men AJ11 DS Monstars 378037-003
Target Output String 4: Space Jam Nike Air Jordan
If I feed the algorithm a new string, I'd like it to output something similar to below:
Input String: 2016 Air Jordan Retro 11 XI "Space Jam" MEN'S & GS Size 4Y-15 Nike
Target Output String: Space Jam Nike Air Jordan
I have used a CRF algorithm to do this somewhat successfully, but that requires preparing/bucketing the input strings/tokens into specific POS or attribute categories, which isn't really necessary here.
Instead, what I'd like to do is feed the input and output strings/tokens into a neural net, and have it give me a predicted output string based on the tokens that are present in a new string.
What I've tried so far is creating a sparse matrix with all of the input tokens and creating a model with the entire result string as an output, but this doesn't have great results.
Is there a better way to go about this? Thanks.

Related

Fitting a ZPL 128 barcode on a 2 inch wide Zebra printer label

With a Zebra printer at 203 dpi and 2 inch wide labels, I am unable to fit the 128 barcode correctly.
^XA
^DFR:g7-1x1-sn.zpl^FS
^FXuuid:d76fd680-3c6c-4a3b-9acb-baf585c6f677
^FXdensity:12
^FX-Options_S-{"density": 12, "width": 3, "height": 1.5}-E_Options-
~TA000~JSN^LT0^MNW^MTT^PON^PMN^LH0,0^JMA^PR3,3~SD17^JUS^LRN^CI0^XZ
^XA
^MMT
^PW406
^LL0300
^LS0
^FT30,30^AAN,18,6^FH\^FDReq #: 0000123^FS
^FT30,55^AAN,18,6^FH\^FDAcct #: 987654321^FS
^FT30,80^AAN,18,6^FH\^FDLastName, FirstName^FS
^BY1
^FT30,160^BCN,50,Y,N,N
^FT30,170^AAN,18,6^FD9751378600002570^FS
^XZ
With ^BY1 the barcode fits but is too tight for some readers to read.
With ^BY2 the barcode prints better but the end gets cut off.
If someone can help with this, would be greatly appreciated.
Your best bet is to specify automatic encoding mode (parameter m in the reference manual). Code128 has a high density numeric mode that encodes two digits for each codeword and will be selected by the ZPL encoder when it sees a string of digits.
This is your example with automatic mode and ^BY2:
^XA
^MMT
^PW406
^LL0300
^LS0
^FT30,30^AAN,18,6^FH\^FDReq #: 0000123^FS
^FT30,55^AAN,18,6^FH\^FDAcct #: 987654321^FS
^FT30,80^AAN,18,6^FH\^FDLastName, FirstName^FS
^BY2
^FT30,160^BCN,50,Y,N,N,A
^FT30,170^AAN,18,6^FD9751378600002570^FS
^XZ
You can try using a QR or Datamatrix code if your scanner reads 2D barcodes.

Google Sheets Split remove characters and unwanted words

I have this data as a sample in a column:
3 PACK BAG 1500 ML CONTAIN 600 ML AMINO ACID, 600 ML GLUCOSE, 300 ML LIPID EMULSION
I am using this formula to remove unwanted characters: =SPLIT(A2:A,"1234567890-=[]\;',./!##$%^&*()")
So it returns me:
PACK BAG ML C NTAIN ML AMIN ACID ML GLUC SE ML LIPID EMULSI N
Now i would like to add in my formula =SPLIT(A2:A,"1234567890-=[]\;',./!##$%^&*()") a function to remove "MC" and "C" OR "SE".
How i can update my formula split to remove the specific chain of characters (words) ?
=SPLIT(REGEXREPLACE(A2:A, "(MC|C|SE)", " "),"1234567890-=[]\;',./!##$%^&*()")
You could pre-process your string with REGEXREPLACE to substitute a specific character (eg. whitespace) for these specific words before applying the SPLIT function.

Parsing Dublin Core data

How do I parse Dublin Core data in iOS? The data is like this:
dc:contributor Binder, Henry.
dc:creator Crane, Stephen, 1871-1900.
dc:date c1982
dc:description This stirring tale of action in the American Civil War captures the immediacies and experiences of actual battle and army life.
dc:format ix, 173 p. ; 22 cm.
dc:identifier 0393013456
dc:identifier 9780393013450
dc:identifier 0380641135 (pbk.)
dc:identifier 9780380641130 (pbk.)
dc:language eng
dc:publisher Norton
dc:subject Chancellorsville, Battle of, Chancellorsville, Va., 1863--Fiction.
dc:subject Virginia--History--Civil War, 1861-1865--Fiction.
dc:title The red badge of courage : an episode of the American Civil War
dc:type War stories.
dc:type Historical fiction.
dc:type Text
oclcterms:recordCreationDate 811217
oclcterms:recordIdentifier 81022419
oclcterms:recordIdentifier 8114241
It looks like your data structure structure has a predictable format, ie each line has a clear tag and then a tab-like space.
Hence if you are familiar with the tag names you can definitely create a small program to consume each line accordingly.

How to recognize mobile number in a given text?

I want to extract valid(on the basis of format) mobile numbers from a text.
e.g. I/O some text (987) 456 7890, (987)-456-7890 again some text
O/P 9874567890 9874567890
problem is, there are many valid mobile formats in all over world like.
text = "Denmark 11 11 11 11, 1111 1111 "
// + "Germany 03333 123456, +49 (3333) 123456 "
// + "Netherlands + 31 44 12345678 Russia +7(555)123-123 "
// + "spain 12-123-12-12 switzerland +41 11 222 22 22 "
// + "Uk (01222) 333333 India +91-12345-12345 "
// + "Austrailia (04) 1231 1231 USA (011) 154-123-4567 "
// + "China 1234 5678 France 01-23-45-67-89 "
// + "Poland (12) 345 67 89 Singapore 123 4567 "
// + "Thailand (01) 234-5678, (012) 34-5678 "
// + "United Kingdom 0123 456 7890, 01234 567890 "
// + "United States (987) 456 7890, (987)-456-7890+ etc."
How to cover all mobile formats?
min and max length of the mobile numbers(with or without country code)?
how to recognize that mobile number has country code or not?
You might want to check if this fits your needs: A comprehensive regex for phone number validation
By experience I know how this works in my phone OS. It looks at a long enough sequences of digits, separated by a set of allowed chars.
In principle something like:
[\+]?([0-9]|[\(\).- ]){min,max}
This regex is suboptimal since it also looks for long sequences of separator chars. You will probably need to filter those results out as well.
A very simple method with some false positives, but false positives are IMPO better than misses.
You shouldn't use the list of samples you got as a guide to actual mobile phone numbers.
For example the number sequence shown for the Netherlands is incorrect, in that it doesn't cover just mobile numbers but ALL regular phone numbers (it doesn't cover such things as 0800 and 0900 numbers for which different rules apply) and is missing an element even for that.
I can only assume the list is similarly incorrect for other countries (and of course it's far from complete in that it doesn't cover all countries, but maybe you posted only a fragment).
To parse a phone number you'd have to first remove all white space and other formatting characters from what could be a phone number, then check whether it has the correct length to be one, then try to deduce whether it includes a country code or not.
If it includes a country code but doesn't start with either 00 or + (both are used to indicate an international number) it might not be a phone number after all.
Does it include an area code? If so, is the area code one associated with mobile phones (for example in the Netherlands all mobile phone numbers have area code 06, BUT in the past this wasn't always the case so if you have an old document a 06 area code may not be a mobile number anyway.
After you've deduced that (and AFAIK mobile numbers always include an area code) you have to check if the remaining numbers make up something that could be an actual phone number without area code based on the length of the number (hint: area code + numer together have to be 10 long here, and I think everywhere).
And all that while taking into consideration that the rules may well be different for different countries or even different networks within some countries.
And of course if you find a number that looks like a valid phone number it still may not be.
It could be some other number that just looks like a phone number but isn't.
Simple search of all matching string formats in this case is not right way. The optimal way is using Regular Expressions to find all matches of phone numbers, but Blackberry java don't have built-in capabilities to process Regular Expressions.
But you can use 3-rd party library for J2ME implementing RegEx processing, smth. like this.
// Regex - Check Singapore valid mobile numbers
public static boolean isSingaporeMobileNo(String str) {
Pattern mobNO = Pattern.compile("^(((0|((\\+)?65([- ])?))|((\\((\\+)?65\\)([- ])?)))?[8-9]\\d{7})?$");
Matcher matcher = mobNO.matcher(str);
if (matcher.find()) {
return true;
} else {
return false;
}
}

Canonicalize NFL team names

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
http://www.footballlocks.com/nfl_odds.shtml
http://www.repole.com/sun4cast/freepick.shtml
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?
Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.
Scan[(fullname[First##] = #[[2]])&, {
{"ari", "Arizona Cardinals"}, {"atl", "Atlanta Falcons"},
{"bal", "Baltimore Ravens"}, {"buf", "Buffalo Bills"},
{"car", "Carolina Panthers"}, {"chi", "Chicago Bears"},
{"cin", "Cincinnati Bengals"}, {"clv", "Cleveland Browns"},
{"dal", "Dallas Cowboys"}, {"den", "Denver Broncos"},
{"det", "Detroit Lions"}, {"gbp", "Green Bay Packers"},
{"hou", "Houston Texans"}, {"ind", "Indianapolis Colts"},
{"jac", "Jacksonville Jaguars"}, {"kan", "Kansas City Chiefs"},
{"mia", "Miami Dolphins"}, {"min", "Minnesota Vikings"},
{"nep", "New England Patriots"}, {"nos", "New Orleans Saints"},
{"nyg", "New York Giants NYG"}, {"nyj", "New York Jets NYJ"},
{"oak", "Oakland Raiders"}, {"phl", "Philadelphia Eagles"},
{"pit", "Pittsburgh Steelers"}, {"sdc", "San Diego Chargers"},
{"sff", "San Francisco 49ers forty-niners"}, {"sea", "Seattle Seahawks"},
{"stl", "St Louis Rams"}, {"tam", "Tampa Bay Buccaneers"},
{"ten", "Tennessee Titans"}, {"wsh", "Washington Redskins"}}]
Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:
teams = keys#fullnames;
(* argMax[f, domain] returns the element of domain for which f of that element is
maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First#dom, Rest#dom]
canonicalize[s_] := argMax[StringLength#LongestCommonSubsequence[" "<>s<>" ",
" "<>fullname##<>" ", IgnoreCase->True]&, teams]
Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:
Denver
Minnesota
Arizona
Jacksonville
and the other looks like
Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars
Seems like, in this case, some pretty simple substring matching would do it.
If you know both the source and destination names, then you just need to map them.
In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:
$map = array('49ers' => 'San Francisco 49ers',
'packers' => 'Green Bay Packers');
foreach($incoming_name as $name) {
echo $map[$name];
}

Resources