I have data in the following structure and my goal is to group the data by the columns Region, Manager, and Employee Name then to show all the products IDs sold by these group. The data is in the following format where product ID sold is the unique key value.
Region
Manager
Employee Name
Product ID Sold
Mid Atlantic
Jim
Name 1
11112
North East
Bob
Name 2
21323
Mid Atlantic
Cat
Name 3
43124
Mid Atlantic
Cat
Name 4
123421
North East
Bob
Name 5
3245
North West
Kate
Name 6
12343124
North West
Mike
Name 7
1234324
North West
Mike
Name 8
3234
North West
Kate
Name 9
53125
Mid Atlantic
Jim
Name 1
2133
North East
Bob
Name 2
123123
Mid Atlantic
Cat
Name 3
123213
Mid Atlantic
Cat
Name 4
213123
North East
Bob
Name 5
123213
North West
Kate
Name 6
123123
North West
Mike
Name 7
123123
North West
Mike
Name 8
123123
North West
Kate
Name 9
123213
Goal Structure
Region
Manager
Employee Name
Product ID Sold
Mid Atlantic
Jim
Name 1
11112
2133
213213
1232
1123
Cat
Name 3
43124
123213
Name 4
123421
213123
North East
Bob
Name 2
21323
123123
Name 5
…
…
North West
Kate
Name 6
…
…
…
…
Name 9
…
…
Mike
Name 7
…
…
Name 8
…
…
Related
Preprocess the data and see the results after and before preprocessing(Report as accuracy)
Draw the following charts:
Corelation chart Heatmap chart
Missing Values Heatmap chart
Line chart/ scatter chart for Country Vs Purchased, Age Vs Purchased and Salary Vs Purchased
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 Yes
France 18888 No
Spain 17 67890 Yes
Germany 12000 No
Spain 38 98888 No
Germany 50 Yes
France 35 58000 Yes
Spain 12345 No
France 23 Yes
Germany 55 78456 No
France 43215 Yes
Sometimes it's hard to understand from scatter plot like Country vs Purchased. Three country of your list somehow purhcased. It can be helpful to do heatmap here
import pandas as pd
from matplotlib import pyplot as plt
#read csv using panda
df = pd.read_csv('Data.csv')
copydf = df
#before data preprocessing
print(copydf)
#fill nan value with average of age and salary
df['Age'] = df['Age'].fillna(df['Age'].mean(axis=0))
df['Salary '] = df['Salary'].fillna(df['Salary'].mean(axis=0))
#after data preprocessing
print(df)
plt.figure(1)
# Country Vs Purchased
plt.subplot(221)
plt.scatter(df['Country'], df['Purchased'])
plt.title('Country vs Purchased')
plt.grid(True)
# Age Vs Purchased
plt.subplot(222)
plt.scatter(df['Age'], df['Purchased'])
plt.title('Age vs Purchased')
plt.grid(True)
# Salary Vs Purchased
plt.subplot(223)
plt.scatter(df['Salary'], df['Purchased'])
plt.title('Salary vs Purchased')
plt.grid(True)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
wspace=0.5)
plt.show()
I am using Google sheets and I am trying to concatenate multiple column A values in Column C, when and if Column B has a duplicate:
Sample data:
Column A Column B Column C
1 1247 Santa Fe 1250/1150
2 1250 Santa Fe 1247/1150
3 1258 North Shore 1354
4 1341 Hogan 1255
5 1255 Hogan 1341
6 1354 North Shore 1258
7 1150 Santa Fe 1247/1250
Here, Column C needs to have multiple concatenated values of A, corresponding to the duplicates in column B.
C1:
=JOIN("/",FILTER($A$1:$A$7,$B$1:$B$7=B1,ROW($B$1:$B$7)<>ROW(B1)))
Drag fill down.
I have the dataset like this
and want out put like this how can I do that
Here is sample dataset
ID COMP_ID CAR_ID ENGINE COLOR CC
1 c1 car3 xyz blue 2500
2 c2 car4 xyz white 1000
3 c1 car6 xyz green 3500
4 c2 car1 xyz black 4500
5 c3 car5 xyz green 4000
6 c1 car2 xyz red 3000
7 c2 car3 xyz gray 1500
8 c3 car4 xyz silver 2000
You can try a tJavaRow something like :
output_row.foo=input_row.row1+"\n"+input_row.row2;
foo must exist in your Output Schema
and row1 and row2 in your Input Schema
Else, you can concatenate them in a TMap in the same way.
I am trying to understand database normalisation. I saw this example of 2 Normal form which is not 3 normal forms
Tournament Year Winner Winner_Date_of_Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Here the primary key is Tournament, Year. So no non primary key attribute is Functionally dependent on subset of primary, it is in 2NF.
How, acc to wikipedia, it is not in 3 NF because
Touranment, Year -> Winner and
Winner -> Winner_Date_Of_Birth
So there is a transitive property of Functional Dependency among keys. I understand this part, but what I would like to know is that, Since for our key
(Tournament,Year) there can only be one unique winner_date_of_birth, is it right to say that ( Touranment, Year ) -> Winner_Date_Of_Birth without using the transitive property above?
Yes, transitive means that you can derive A -> C from A -> B and B -> C.
My background is in databases and SQL coding. I’ve used the CTABLES feature in SPSS a little, mostly for calculating percentiles which is slow in sql. But now I have a data set where I need to calculate percentiles for a weighted average which is not as straightforward, and I can’t figure out if it’s possible in SPSS or not.
I have data similar to the following
Country Region District Units Cost per Unit
USA Central DivisionQ 10 3
USA Central DivisionQ 12 2.5
USA Central DivisionQ 25 1.5
USA Central DivisionQ 6 4
USA Central DivisionA 3 3.25
USA Central DivisionA 76 1.75
USA Central DivisionA 42 1.5
USA Central DivisionA 1 8
USA Eastern DivisionQ 14 3
USA Eastern DivisionQ 25 2.5
USA Eastern DivisionQ 75 1.5
USA Eastern DivisionQ 9 4
USA Eastern DivisionA 100 3.25
USA Eastern DivisionA 4 1.75
USA Eastern DivisionA 33 1.5
USA Eastern DivisionA 17 8
452 51
For every possible segmentation (Country, Country-Region, Country-Region-District, Country-District etc.)
I want to get the Avg. Cost per Unit, ie. Cost per Unit weighted by Units, so that is total SUM(Units*CostPerUnit)/SUM(Units)
And I need to get the 10th, 25th, 50th, 75th, 90th percentiles for each possible segmentation.
The way I do this part in SQL is extract all the rows in the segment, sort and rank by Cost Per Unit. Get a running sum of Units for each row. Determine the ratio of that running sum to the total units, and that percentage determines which row has the Cost Per Unit for that percentile. An example , for Country = USA and Division = Q
Unit Running
Country Units Cost Unit divided by
Per Unit Running Total Units
USA Central DivisionQ 25 1.5 25 0.14 10th
USA Eastern DivisionQ 75 1.5 100 0.56 25th/50
USA Central DivisionQ 12 2.5 112 0.63
USA Eastern DivisionQ 25 2.5 137 0.77 75th
USA Central DivisionQ 10 3 147 0.83
USA Eastern DivisionQ 14 3 161 0.91 90th
USA Central DivisionQ 6 4 167 0.94
USA Eastern DivisionQ 9 4 176 1
This takes a very long time to do for each segment. Is it possible to leverage SPSS to do the same thing more easily?
Use SPLIT FILES (Data > Select Cases) to define the group and then use FREQUENCIES (Analyze > Descriptive Statistics > Frequencies) to calculate the statistics. Suppress the actual frequency tables (/FORMAT=NOTABLE).