Data Visualization & Machine Learning - machine-learning

Preprocess the data and see the results after and before preprocessing(Report as accuracy)
Draw the following charts:
Corelation chart Heatmap chart
Missing Values Heatmap chart
Line chart/ scatter chart for Country Vs Purchased, Age Vs Purchased and Salary Vs Purchased
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 Yes
France 18888 No
Spain 17 67890 Yes
Germany 12000 No
Spain 38 98888 No
Germany 50 Yes
France 35 58000 Yes
Spain 12345 No
France 23 Yes
Germany 55 78456 No
France 43215 Yes

Sometimes it's hard to understand from scatter plot like Country vs Purchased. Three country of your list somehow purhcased. It can be helpful to do heatmap here
import pandas as pd
from matplotlib import pyplot as plt
#read csv using panda
df = pd.read_csv('Data.csv')
copydf = df
#before data preprocessing
print(copydf)
#fill nan value with average of age and salary
df['Age'] = df['Age'].fillna(df['Age'].mean(axis=0))
df['Salary '] = df['Salary'].fillna(df['Salary'].mean(axis=0))
#after data preprocessing
print(df)
plt.figure(1)
# Country Vs Purchased
plt.subplot(221)
plt.scatter(df['Country'], df['Purchased'])
plt.title('Country vs Purchased')
plt.grid(True)
# Age Vs Purchased
plt.subplot(222)
plt.scatter(df['Age'], df['Purchased'])
plt.title('Age vs Purchased')
plt.grid(True)
# Salary Vs Purchased
plt.subplot(223)
plt.scatter(df['Salary'], df['Purchased'])
plt.title('Salary vs Purchased')
plt.grid(True)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
wspace=0.5)
plt.show()

Related

Averaging a Data Series in a Google Sheet to a single entry per period regardless of the number of samples in the larger period?

I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there

Select every hour query

I have a simple weather station DB with example content:
time humi1 humi2 light pressure station-id temp1 temp2
---- ----- ----- ----- -------- ---------- ----- -----
1530635257289147315 66 66 1834 1006 bee1 18.6 18.6
1530635317385229860 66 66 1832 1006 bee1 18.6 18.6
1530635377466534866 66 66 1829 1006 bee1 18.6 18.6
Station writes data every minute. I want to get SELECT not with all series, but just series written every hour (or every 60th series, simply said). How can I achieve it?
I tried to experiment with ...WHERE time % 60 = 0, but it didn`t work. It seems, that time column doesnt permit any math operations (/, %, etc).
Group by along with a one of the functions can do what you want:
SELECT FIRST("humi1"), FIRST("humi2"), ... GROUP BY time(1h)
I would imagine for most climate data you'd want the MEAN or MEDIAN rather than a single data point every hour
basic example, and more complex example

GLMM glmer and glmmADMB - comparison error

I am trying to compare if there are differences in the number of obtained seeds in five different populations with different applied treatments, and having maternal plant and paternal plant as random effects. First I tried to fit a glmer model.
dat <-dat [,c(12,7,6,13,8,11)]
dat$parents<-factor(paste(dat$mother,dat$father,sep="_"))
compareTreat <- function(d)
{
d$treatment <-factor(d$treatment)
print (tapply(d$pop,list(d$pop,d$treatment),length))
print(summary(fit<-glmer(seed_no~treatment+(1|pop/mother)+
(1|pop/father),data=d,family="poisson")))
}
Then, I compared two treatments in two populations (pop 64 and pop 121, in that case). The other populations do not have this particular treatments, so I get NA values for those.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
This is the output:
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: seed_no ~ treatment + (1 | pop/mother) + (1 | pop/father)
Data: d
AIC BIC logLik deviance df.resid
592.5 609.2 -290.2 580.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-1.8950 -0.8038 -0.2178 0.4440 1.7991
Random effects:
Groups Name Variance Std.Dev.
father.pop (Intercept) 3.566e-01 5.971e-01
mother.pop (Intercept) 9.456e-01 9.724e-01
pop (Intercept) 1.083e-10 1.041e-05
pop.1 (Intercept) 1.017e-10 1.008e-05
Number of obs: 119, groups: father:pop, 81; mother:pop, 24; pop, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74664 0.24916 2.997 0.00273 **
treatmentIE 7x -0.05789 0.17894 -0.324 0.74629
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
tretmntIE7x -0.364
It seems there are no differences between treatments. But as there are many zeros in the data, a zero-inflated model would be worthy to try. I tried with glmmabmd, and I wrote the script like this:
compareTreat<-function(d)
{
d$treatment<-factor(d$treatment)
print(tapply(d$pop,list(d$pop,d$treatment), length))
print(summary(fit_zip<-glmmadmb(seed_no~treatment + (1|pop/mother)+
(1|pop/father),data=d,family="poisson", zeroInflation=TRUE)))
}
Then I compared again the treatments. Here I have not changed the code.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
But in that case, the output is
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Error in pop:father : NA/NaN argument
In addition: Warning messages:
1: In pop:father :
numerical expression has 119 elements: only the first used
2: In pop:father :
numerical expression has 119 elements: only the first used
3: In eval(parse(text = x), data) : NAs introduced by coercion
Called from: eval(parse(text = x), data)
I tried to change everything I came up with, but I still don't know where the problem is.
If I remove the (1|pop/father) from the glmmadmb script, the model runs, but it feels not correct. I wonder if the mistake is in the loop prior to the glmmadmb but it worked OK in the glmer model, or if it is in the comparison itself after the model. I tried as well to remove NAs with na.omit in case that was an issue, but it did not make a difference. Why does the script stop and does not continue running?
I am a student beginner with RStudio, my version is 3.4.2, called Short Summer. If someone with experience could point me in the right direction I would be very grateful!
H.

SPSS Calculate percentiles with weighted average

My background is in databases and SQL coding. I’ve used the CTABLES feature in SPSS a little, mostly for calculating percentiles which is slow in sql. But now I have a data set where I need to calculate percentiles for a weighted average which is not as straightforward, and I can’t figure out if it’s possible in SPSS or not.
I have data similar to the following
Country Region District Units Cost per Unit
USA Central DivisionQ 10 3
USA Central DivisionQ 12 2.5
USA Central DivisionQ 25 1.5
USA Central DivisionQ 6 4
USA Central DivisionA 3 3.25
USA Central DivisionA 76 1.75
USA Central DivisionA 42 1.5
USA Central DivisionA 1 8
USA Eastern DivisionQ 14 3
USA Eastern DivisionQ 25 2.5
USA Eastern DivisionQ 75 1.5
USA Eastern DivisionQ 9 4
USA Eastern DivisionA 100 3.25
USA Eastern DivisionA 4 1.75
USA Eastern DivisionA 33 1.5
USA Eastern DivisionA 17 8
452 51
For every possible segmentation (Country, Country-Region, Country-Region-District, Country-District etc.)
I want to get the Avg. Cost per Unit, ie. Cost per Unit weighted by Units, so that is total SUM(Units*CostPerUnit)/SUM(Units)
And I need to get the 10th, 25th, 50th, 75th, 90th percentiles for each possible segmentation.
The way I do this part in SQL is extract all the rows in the segment, sort and rank by Cost Per Unit. Get a running sum of Units for each row. Determine the ratio of that running sum to the total units, and that percentage determines which row has the Cost Per Unit for that percentile. An example , for Country = USA and Division = Q
Unit Running
Country Units Cost Unit divided by
Per Unit Running Total Units
USA Central DivisionQ 25 1.5 25 0.14 10th
USA Eastern DivisionQ 75 1.5 100 0.56 25th/50
USA Central DivisionQ 12 2.5 112 0.63
USA Eastern DivisionQ 25 2.5 137 0.77 75th
USA Central DivisionQ 10 3 147 0.83
USA Eastern DivisionQ 14 3 161 0.91 90th
USA Central DivisionQ 6 4 167 0.94
USA Eastern DivisionQ 9 4 176 1
This takes a very long time to do for each segment. Is it possible to leverage SPSS to do the same thing more easily?
Use SPLIT FILES (Data > Select Cases) to define the group and then use FREQUENCIES (Analyze > Descriptive Statistics > Frequencies) to calculate the statistics. Suppress the actual frequency tables (/FORMAT=NOTABLE).

What are the expected values for the various "ENUM" types returned by the SurveyMonkey API?

There are multiple endpoints which return "ENUM" types, such as:
the ENUM-integer language_id field from the get_survey_list and the get_survey_details endpoints
the String-ENUM type field from the get_collector_list endpoint
the String-ENUM collection_mode and status fields from the get_respondent_list endpoint
I understand what this means but I don't see the possible values documented anywhere. So, to repeat the title: what are the possible values for each of these enums?
I don't have enough reputation to comment, so as well as Miles' answer may I offer this list mapping the types to the Qtype in the Relational Database format, as we are transitioning from that. The list is mapped to SM's ResponseTable.html but that file does not give Qtype 160, or 70 which I guess is the ranking one.
Question Family Question Subtype QType
single_choice vertical 10
vertical_two_col 11
vertical_three_col 12
horiz 13
menu 14
multiple_choice vertical 20
vertical_two_col 21
vertical_three_col 22
horiz 23
matrix single 30
multi 40
menu 50
rating 60
ranking
open_ended numerical 80
single 90
multi 100
essay 110
demographic us 120
international 130
datetime date_only 140
time_only 141
both 142
presentation image
video
descriptive_text 160
The language_id, status and collection_mode enums are documented here: https://developer.surveymonkey.com/mashery/data_types
The String-ENUM type field from the get_collector_list endpoint:
Collector Types
url Url Collector
embedded Embedded Collector
email Email Collector
facebook Facebook Collector
audience SM Audience Collector
The String-ENUM collection_mode and status fields from the get_respondent_list endpoint:
Respondent Collection Modes
normal Collected response online
manual Admin entered response in settings
survey_preview Collected response on a preview screen
edited Collected via a edit to a previous response
Respondent Statuses
completed Finished answering the survey
partial Started but did not finish answering the survey
The ENUM-integer language_id field from the get_survey_list and the get_survey_details endpoints:
Language Ids
1 English
2 Chinese(Simplified)
3 Chinese(Traditional)
4 Danish
5 Dutch
6 Finnish
7 French
8 German
9 Greek
10 Italian
11 Japanese
12 Korean
13 Malay
14 Norwegian
15 Polish
16 Portuguese(Iberian)
17 Portuguese(Brazilian)
18 Russian
19 Spanish
20 Swedish
21 Turkish
22 Ukrainian
23 Reverse
24 Albanian
25 Arabic
26 Armenian
27 Basque
28 Bengali
29 Bosnian
30 Bulgarian
31 Catalan
32 Croatian
33 Czech
34 Estonian
35 Filipino
36 Georgian
37 Hebrew
38 Hindi
39 Hungarian
40 Icelandic
41 Indonesian
42 Irish
43 Kurdish
44 Latvian
45 Lithuanian
46 Macedonian
47 Malayalam
48 Persian
49 Punjabi
50 Romanian
51 Serbian
52 Slovak
53 Slovenian
54 Swahili
55 Tamil
56 Telugu
57 Thai
58 Vietnamese
59 Welsh

Resources