My background is in databases and SQL coding. I’ve used the CTABLES feature in SPSS a little, mostly for calculating percentiles which is slow in sql. But now I have a data set where I need to calculate percentiles for a weighted average which is not as straightforward, and I can’t figure out if it’s possible in SPSS or not.
I have data similar to the following
Country Region District Units Cost per Unit
USA Central DivisionQ 10 3
USA Central DivisionQ 12 2.5
USA Central DivisionQ 25 1.5
USA Central DivisionQ 6 4
USA Central DivisionA 3 3.25
USA Central DivisionA 76 1.75
USA Central DivisionA 42 1.5
USA Central DivisionA 1 8
USA Eastern DivisionQ 14 3
USA Eastern DivisionQ 25 2.5
USA Eastern DivisionQ 75 1.5
USA Eastern DivisionQ 9 4
USA Eastern DivisionA 100 3.25
USA Eastern DivisionA 4 1.75
USA Eastern DivisionA 33 1.5
USA Eastern DivisionA 17 8
452 51
For every possible segmentation (Country, Country-Region, Country-Region-District, Country-District etc.)
I want to get the Avg. Cost per Unit, ie. Cost per Unit weighted by Units, so that is total SUM(Units*CostPerUnit)/SUM(Units)
And I need to get the 10th, 25th, 50th, 75th, 90th percentiles for each possible segmentation.
The way I do this part in SQL is extract all the rows in the segment, sort and rank by Cost Per Unit. Get a running sum of Units for each row. Determine the ratio of that running sum to the total units, and that percentage determines which row has the Cost Per Unit for that percentile. An example , for Country = USA and Division = Q
Unit Running
Country Units Cost Unit divided by
Per Unit Running Total Units
USA Central DivisionQ 25 1.5 25 0.14 10th
USA Eastern DivisionQ 75 1.5 100 0.56 25th/50
USA Central DivisionQ 12 2.5 112 0.63
USA Eastern DivisionQ 25 2.5 137 0.77 75th
USA Central DivisionQ 10 3 147 0.83
USA Eastern DivisionQ 14 3 161 0.91 90th
USA Central DivisionQ 6 4 167 0.94
USA Eastern DivisionQ 9 4 176 1
This takes a very long time to do for each segment. Is it possible to leverage SPSS to do the same thing more easily?
Use SPLIT FILES (Data > Select Cases) to define the group and then use FREQUENCIES (Analyze > Descriptive Statistics > Frequencies) to calculate the statistics. Suppress the actual frequency tables (/FORMAT=NOTABLE).
Related
I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there
I'm trying to import a table where the commas are the 1000 separator,
example: 32,100 is 32100 but it is treating it as 32.1 instead.
This is a similar table (first one / top left):
https://en.wikipedia.org/wiki/Demographics_of_the_world
imgur for screenshots:
https://imgur.com/a/hJR9tox
I want it to say:
Year million
1500 458
1600 580
1700 682
1750 791
1800 978
1850 1262
1900 1650
1950 2521
1999 5978
2008 6707
2011 7000
2015 7350
2018 7600
2020 7750
But it comes out as:
Year million
1500 458
1600 580
1700 682
1750 791
1800 978
1850 1,262
1900 1,65
1950 2,521
1999 5,978
2008 6,707
2011 7
2015 7,35
2018 7,6
2020 7,75
This is the function I'm using:
=IMPORTHTML("https://en.wikipedia.org/wiki/Demographics_of_the_world"; "table"; 1)
I have also tried using this function:
=IMPORTXML("https://en.wikipedia.org/wiki/Demographics_of_the_world"; "//*[#id='mw-content-text']/div/table[1]/tbody")
But that shows as this witch is extremely hard to understand since it looks like this and still removes the zeros:
World Population[1][2] Yearmillion 1500458 1600580 1700682 1750791 1800978 18501,262 19001,65 19502,521 19995,978 20086,707 20117 20157,35 20187,6 20207,75
Other things i have tried is:
forsing it to always print out three decimals, that wont work since it adds more numbers to the end of all numbers.
The main & easiest possible solution that you have is to change your Spreadsheet's locale setting to one that uses the , as mile separator.
As an alternative, if changing this setting is really not a possibility, you could create a script that uses URLFetchApp to retrieve the page's contents and parses the values, taking into considerations the usage of , as mile separator.
Preprocess the data and see the results after and before preprocessing(Report as accuracy)
Draw the following charts:
Corelation chart Heatmap chart
Missing Values Heatmap chart
Line chart/ scatter chart for Country Vs Purchased, Age Vs Purchased and Salary Vs Purchased
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 Yes
France 18888 No
Spain 17 67890 Yes
Germany 12000 No
Spain 38 98888 No
Germany 50 Yes
France 35 58000 Yes
Spain 12345 No
France 23 Yes
Germany 55 78456 No
France 43215 Yes
Sometimes it's hard to understand from scatter plot like Country vs Purchased. Three country of your list somehow purhcased. It can be helpful to do heatmap here
import pandas as pd
from matplotlib import pyplot as plt
#read csv using panda
df = pd.read_csv('Data.csv')
copydf = df
#before data preprocessing
print(copydf)
#fill nan value with average of age and salary
df['Age'] = df['Age'].fillna(df['Age'].mean(axis=0))
df['Salary '] = df['Salary'].fillna(df['Salary'].mean(axis=0))
#after data preprocessing
print(df)
plt.figure(1)
# Country Vs Purchased
plt.subplot(221)
plt.scatter(df['Country'], df['Purchased'])
plt.title('Country vs Purchased')
plt.grid(True)
# Age Vs Purchased
plt.subplot(222)
plt.scatter(df['Age'], df['Purchased'])
plt.title('Age vs Purchased')
plt.grid(True)
# Salary Vs Purchased
plt.subplot(223)
plt.scatter(df['Salary'], df['Purchased'])
plt.title('Salary vs Purchased')
plt.grid(True)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
wspace=0.5)
plt.show()
I have a simple weather station DB with example content:
time humi1 humi2 light pressure station-id temp1 temp2
---- ----- ----- ----- -------- ---------- ----- -----
1530635257289147315 66 66 1834 1006 bee1 18.6 18.6
1530635317385229860 66 66 1832 1006 bee1 18.6 18.6
1530635377466534866 66 66 1829 1006 bee1 18.6 18.6
Station writes data every minute. I want to get SELECT not with all series, but just series written every hour (or every 60th series, simply said). How can I achieve it?
I tried to experiment with ...WHERE time % 60 = 0, but it didn`t work. It seems, that time column doesnt permit any math operations (/, %, etc).
Group by along with a one of the functions can do what you want:
SELECT FIRST("humi1"), FIRST("humi2"), ... GROUP BY time(1h)
I would imagine for most climate data you'd want the MEAN or MEDIAN rather than a single data point every hour
basic example, and more complex example
There are multiple endpoints which return "ENUM" types, such as:
the ENUM-integer language_id field from the get_survey_list and the get_survey_details endpoints
the String-ENUM type field from the get_collector_list endpoint
the String-ENUM collection_mode and status fields from the get_respondent_list endpoint
I understand what this means but I don't see the possible values documented anywhere. So, to repeat the title: what are the possible values for each of these enums?
I don't have enough reputation to comment, so as well as Miles' answer may I offer this list mapping the types to the Qtype in the Relational Database format, as we are transitioning from that. The list is mapped to SM's ResponseTable.html but that file does not give Qtype 160, or 70 which I guess is the ranking one.
Question Family Question Subtype QType
single_choice vertical 10
vertical_two_col 11
vertical_three_col 12
horiz 13
menu 14
multiple_choice vertical 20
vertical_two_col 21
vertical_three_col 22
horiz 23
matrix single 30
multi 40
menu 50
rating 60
ranking
open_ended numerical 80
single 90
multi 100
essay 110
demographic us 120
international 130
datetime date_only 140
time_only 141
both 142
presentation image
video
descriptive_text 160
The language_id, status and collection_mode enums are documented here: https://developer.surveymonkey.com/mashery/data_types
The String-ENUM type field from the get_collector_list endpoint:
Collector Types
url Url Collector
embedded Embedded Collector
email Email Collector
facebook Facebook Collector
audience SM Audience Collector
The String-ENUM collection_mode and status fields from the get_respondent_list endpoint:
Respondent Collection Modes
normal Collected response online
manual Admin entered response in settings
survey_preview Collected response on a preview screen
edited Collected via a edit to a previous response
Respondent Statuses
completed Finished answering the survey
partial Started but did not finish answering the survey
The ENUM-integer language_id field from the get_survey_list and the get_survey_details endpoints:
Language Ids
1 English
2 Chinese(Simplified)
3 Chinese(Traditional)
4 Danish
5 Dutch
6 Finnish
7 French
8 German
9 Greek
10 Italian
11 Japanese
12 Korean
13 Malay
14 Norwegian
15 Polish
16 Portuguese(Iberian)
17 Portuguese(Brazilian)
18 Russian
19 Spanish
20 Swedish
21 Turkish
22 Ukrainian
23 Reverse
24 Albanian
25 Arabic
26 Armenian
27 Basque
28 Bengali
29 Bosnian
30 Bulgarian
31 Catalan
32 Croatian
33 Czech
34 Estonian
35 Filipino
36 Georgian
37 Hebrew
38 Hindi
39 Hungarian
40 Icelandic
41 Indonesian
42 Irish
43 Kurdish
44 Latvian
45 Lithuanian
46 Macedonian
47 Malayalam
48 Persian
49 Punjabi
50 Romanian
51 Serbian
52 Slovak
53 Slovenian
54 Swahili
55 Tamil
56 Telugu
57 Thai
58 Vietnamese
59 Welsh