machine learning - language prediction using there full names - machine-learning

My dataset contains three columns those are [first_name, Surname and language main ].
for suppose if there is a name called ' prathap singh' who speaks panjabi . and 'ajith komuravelly' - Telugu.
if there is a name called 'Prathap Komuravelly' - the person should be speaking Telugu as surname here contains more weight right ?
How can we go further for this implementation...
Do we need to give the weights for seperate columns or what should i do in this step ?

Related

Change this string - adding a space rather than dash

I have a script which basically changes a characters name and rank if the command CharRankPromote is used. Initially, the automatic name given to the player would be something like TEXT-RANK-TEXT (or something or other), but recent changes in my mind resulted in the need to make a new system being just: RANK NAME. However, the current code does not recognize this new name format, would there be an easy fix? This is the code in question:
for index, rank in next, targetRanks do
if (string.find(name, "[%D+]" .. rank .. "[%D+]")) then
if (newRank == index) then
return "#cRankSameRank", name, rank
end
The string (name, "[%D+]" .. rank .. "[%D+]") would promote someone with the same TEXT-RANK.TEXT:TEST, as it could identify the rank in the name. I'd like to change it so it could identify the rank in the format "RANK NAME," with a space in between (with the string only focusing on the word at the front.

Summary of Results Based on Members of File

I am not quite sure how to word this so I've also included some poorly formatted example :) Basically I have a report exported from Cognos. The report contains a list of cases and the people that are associated to those cases, along with additional information about their First Language and Religion (as an example). What I would like to do is create a summary and/or chart of the results based on the unique case.
Any ideas? Example data below:
Case Reference - Name - First Language - Religion
1234 - Name1 - English - Catholic
1234 - Name2 - French - Protestant
4321 - Name3 - Tamil - Unknown
3345 - Name4 - English - Hindu
So for a summary I'd like to see that for languages there is 1 for Tamil and 1 for French (English would be the default if no other languages are present - so for file 1234 it would have been English if there was no French speaking person). For religions I'd like to be able to see that out of the 3 files, 1 is unknown, 1 is Hindu and also that the 3rd file is actually 2 religions (Catholic and Protestant).
I am not sure if any of this is making sense but hopefully one of you can shed some light on a possible solution. I'd like to template it out so that on line one of the case it would have an x under each heading, but do it automatically instead of manually. Basically, for each unique case are there any members that are French, any that are Tamil, any that are Catholic, any that are Christian, etc...
Thanks!
I hope I'm following correctly. It seems you want to show for each language, how many cases they are associated with and for every case, how many religions are associated with it.
For language, add a column to your report's query called Language Count with the following expression:
count(distinct [Case Reference] for [First Language])
This will count the number of unique cases for each language.
For religions, add a column to your report's query called Religion Count with the following expression:
count(distinct [Religion] for [Case Reference])
This will count the number of unique religions for each case.

Cleaning data in SPSS with name misspellings

I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.

mahout for content based recomendation

I have a list user data : user name, age, sex , address, location etc and
a set of product data : Product name, Cost , description etc
Now i would like to build a recommendation engine that will be able to :
1 Figure out similar products
eg :
name : category : cost : ingredients
x : x1 : 15 : xx1, xx2, xx3
y : y1 : 14 : yy1, yy2, yy3
z : x1 : 12 : xx1, xy1
here x and z are similar.
2 Recommend relevant products from the product list to user
How this kind or recommendation engine can be implement with mahout ? Which all are the available methods ? Is there any useful tutorial/link available ? Please help
In mahout v1 from here https://github.com/apache/mahout your can use "spark-rowsimilarity" to create indicators for each type of metadata, categroy, cost, and ingredients. This will give you three matrices containing similar items for each item based on that particular metadata. This will give you a "more like this" type of recommendation. You can also try combining the metadata into one input matrix and see if that gives better results.
To personalize this record which items the user has expressed some preference for. Index the indicator matrices in Solr, one indicator per Solr "field" all attached to the item ID (name?). Then the query is the user's history against each field. You can boost certain fields to increase their weight in the recommendations.
This is described
On the Mahout site: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
And some slides here: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

k nearest neighbor in SAS: how to get the neighbor list for each row?

currently I'm using the proc discrim in SAS to run a kNN analysis for a data set, but the problem may require me to get the top k neighbor list for each rows in my table, so how can I get this list from SAS??
thanks for the answer, but I'm looking for the neighbor list for each data point, for example if i got data set:
name age zipcode alcohol
John 26 08439 yes
Cathy 49 47789 no
smith 37 90897 no
Tom 34 88642 yes
then i need list:
name neighbor1 neighbor2
John Tom cathy
Cathy Tom Smith
Smith Cathy Tom
Tom John Cathy
I could not find this output from SAS, is there any whay that I can program to get this list? Thank you!
I am not a SAS user, but a quick web lookup seems to give a good answers for your problem:
As far as i know you do not have to implement it by yourself. DISCRIM is enough.
Code for iris data from http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
The long and detailed description is also avaliable here:
http://analytics.ncsu.edu/sesug/2012/SD-09.pdf
And from the sas community:
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. Since PROC DISCRIM doesn't output the Tree it built internally, use "data= test= testout=" option to score new data set.

Resources