What might I add to this syntax to plot the main effects and interaction?
UNIANOVA grade BY gender school
/RANDOM=school
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/PRINT ETASQ DESCRIPTIVE HOMOGENEITY
/CRITERIA=ALPHA(.05)
/DESIGN=gender school gender*school.
This works:
UNIANOVA grade BY gender school
/RANDOM=school
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/PLOT=PROFILE(school gender school*gender) TYPE=LINE ERRORBAR=NO MEANREFERENCE=NO YAXIS=AUTO
/PRINT ETASQ DESCRIPTIVE HOMOGENEITY
/CRITERIA=ALPHA(0.05)
/DESIGN=gender school gender*school.
Related
I'm trying to classify the sentences of a specific column into three labels with Bart Large MNLI. The problem is that the output of the model is "sentence + the three labels + the scores for each label. Output example:
{'sequence': 'Growing special event set production/fabrication company
is seeking a full-time accountant with experience in entertainment
accounting. This position is located in a fast-paced production office
located near downtown Los Angeles.Responsibilities:• Payroll
management for 12+ employees, including processing new employee
paperwork.', 'labels': ['senior', 'middle', 'junior'], 'scores':
[0.5461998581886292, 0.327671617269516, 0.12612852454185486]}
What I need is to get a single column with only the label with the highest score, in this case "senior".
Any feedback which can help me to do it? Right now my code looks like:
df_test = df.sample(frac = 0.0025)
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
sequence_to_classify = df_test["full_description"]
candidate_labels = ['senior', 'middle', 'junior']
df_test["seniority_label"] = df_test.apply(lambda x: classifier(x.full_description, candidate_labels, multi_label=True,), axis=1)
df_test.to_csv("Seniority_Classified_SampleTest.csv")
(Using a Sample of the df for testing code).
And the code I've followed comes from this web, where they do receive a column with labels as an output idk how: https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-service-emails-with-bart-mnli
I'm trying to use Numbers: Spreadsheet style formula to write a nested IF statement and I keep getting this error: Error parsing response. We got: " Show details. Here is the statement:
If("{{Alpine School District Schools}}"<>"", "{{Alpine School District Schools}}", If ("{{Bonsall Unified School District Schools}}"<>"","{{Bonsall Unified School District Schools}}", "NA")
(As an FYI, when I copy and pasted this here I removed the numbers that preceded the fields when copying and pasting {{110470936__Alpine School District Schools}})
This is stating if the field (which is a picklist) Alpine School District is not blank, add the school selected in this field; if it is blank but Bonsall School district is not blank, add the school selected in the Bonsall School District field, if Bonsall is also blank, add “NA”.
Any guidance on how to write to correctly is welcome.
I believe you don't have enough closing parens on the end of your statement. Here's your input, but spaced out to be more informative:
If(
"{{Alpine School District Schools}}"<>"",
"{{Alpine School District Schools}}",
If (
"{{Bonsall Unified School District Schools}}"<>"",
"{{Bonsall Unified School District Schools}}",
"NA"
)
<-- missing paren
Fixing it may be as easy as adding ) to the end of your statement.
Currently I have a database consisted of about 600,000 records represents merchandise with their category information look like below:
{'title': 'Canon camera', 'category': 'Camera'},
{'title': 'Panasonic regrigerator', 'category': 'Refrigerator'},
{'title': 'Logo', 'category': 'Toys'},
....
But there are merchandises without category information.
{'title': 'Iphone6', 'category': ''},
So I'm thinking whether it is possible to train a text classifier based on my items' name by using scikit-learn to help me predict which the category should the merchandise be. I'm forming this problem as a multi-class text classification but there are also one~many pictures for each item so maybe deep learning/Keras can also be used?
I don't know what is the best way to solve this problem so any suggestion or advice is welcome, thank you for reading this.
P.S. the actual text is in Japanese
You could build a 2-char / 3-char model and calculate values e.g. how often does the 3-gram "pho" appear in the category "Camera".
trigrams = {}
for record in records: # only the ones with categories
title = record['title']
cat = record['category']
for trigram in zip(title, title[1:], title[2:])
if trigram not in trigrams:
trigrams[trigram] = {}
for category in categories:
trigrams[trigram] = 0
trigrams[trigram][cat] += 1
Now you can use the titles trigrams to calculate a score:
scores = []
for trigram in zip(title, title[1:], title[2:]):
score = []
for cat in categories:
score.append(trigrams[trigram][cat])
# Normalize
sum_ = float(sum(score))
score = [s / sum_ for s in score]
scores.append(score)
Now score contains a probability distribution for every trigram: P(class | trigram). It does not take into account that some classes are just more common (prior, see Bayes theorem). I'm currently also not quite sure if you should do something against the problem that some titles might just be really long and thus have a lot of trigrams. I guess taking the prior does that already.
If it turns out that you have many trigrams missing, you could switch to bigrams. Or simply do Laplace smoothing.
edit: I've just seen that the text is in Japanese. I think the n-gram approach might be useless there. You could translate the name. However, it is probably easier to just take other sources for this information (e.g. wikipedia / amazon / ebay?)
I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.
currently I'm using the proc discrim in SAS to run a kNN analysis for a data set, but the problem may require me to get the top k neighbor list for each rows in my table, so how can I get this list from SAS??
thanks for the answer, but I'm looking for the neighbor list for each data point, for example if i got data set:
name age zipcode alcohol
John 26 08439 yes
Cathy 49 47789 no
smith 37 90897 no
Tom 34 88642 yes
then i need list:
name neighbor1 neighbor2
John Tom cathy
Cathy Tom Smith
Smith Cathy Tom
Tom John Cathy
I could not find this output from SAS, is there any whay that I can program to get this list? Thank you!
I am not a SAS user, but a quick web lookup seems to give a good answers for your problem:
As far as i know you do not have to implement it by yourself. DISCRIM is enough.
Code for iris data from http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
The long and detailed description is also avaliable here:
http://analytics.ncsu.edu/sesug/2012/SD-09.pdf
And from the sas community:
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. Since PROC DISCRIM doesn't output the Tree it built internally, use "data= test= testout=" option to score new data set.