Hive Query with multiple joins not executing - join

I wrote a Hive query to compute 33 and 66 percentile on multiple columns of a table that contains integer values (including 0).
Just to filter outliers, I added the filter >0 before computing percentile.
I have 46 columns and I calculate 33 and 66 percentile on each column, with the >0 filter on column.
Then I join these results to get a table with 33 and 66 percentiles of these columns.
My issue is that the query doesn't execute. I tried executing with 2 columns and it works fine but doesn't work on this huge number of joins. Can someone suggest an alternate way.
Data looks like this:
C1| C2| C3
---------------
0 | 2 | 3
1 | 0 | 2
2 | 0 | 0
for C1, the data will be [1,2]; for C2 -> [2]; for C3 -> [3,2]

you need not do that
just use percentile udf of hive
select percentile(C1,0.33),.....,percentile(C46,0.33) from table
UNION ALL
select percentile(C1,0.66),.....,percentile(C46,0.66) from table
This gives you a table having 46 columns with first row indicating the 33rd percentile of each column and 2nd row indicating the 66th percentile of each column
or you can do
select percentile(C1,0.33),.....,percentile(C46,0.33) , percentile(C1,0.66),.....,percentile(C46,0.66) from table

Related

How to do a full outer join?

I am trying to do the full join for the data below in two different sheets.
Sheet 9:
Product ID
Name
Quantity
1
addi
55
2
fadi
66
3
sadi
33
Sheet10
Product ID
Variants
Model
1
xyz
2000
2
differ
2001
3
saddd
336
4
fsdfe
2005
Desired output sheet :
Product ID
Name
Quantity
Variants
Model
1
addi
55
xyz
2000
2
fadi
66
differ
2001
3
sadi
33
saddd
336
4
fsdfe
2005
Please also share if we have more columns to join like in sheet 1 and sheet 2 has two more columns like Year, product label etc then what should I change in your proposed solution
I am using this formula but its not returning the desired result
=ARRAYFORMULA({QUERY(SORT(UNIQUE({Sheet9!A1:D; Sheet10!A1:D})), "where Col1 is not null"),IFERROR(VLOOKUP(TRANSPOSE(QUERY(TRANSPOSE(QUERY(SORT(UNIQUE({Sheet9!A1:D; Sheet10!A1:D})), "where Col1 is not null")),,999^99)), TRANSPOSE(QUERY(TRANSPOSE(Sheet9!A1:D),,999^99)), Sheet9!C1:C}, 2, 0),""),IFERROR(VLOOKUP(TRANSPOSE(QUERY(TRANSPOSE(QUERY(SORT(UNIQUE({Sheet9!A1:D; Sheet10!A1:D})), "where Col1 is not null")),,999^99)), {TRANSPOSE(QUERY(TRANSPOSE(Sheet10!A1:D),,999^99)), Sheet10!C1:C}, 2, 0),"")}})
EDITED to consider dynamic row matching.
See this spreadsheet to illustration, but overall there's a question of your setup, but I would break your problem into two steps.
Get distinct list of ID's
You can get that with this formula:
=unique(transpose(split(textjoin(",",true,
iferror(INdex(Sheet2!$A$2:$Z,0,MATCH(A1,Sheet2!1:1,0)),""),
iferror(INdex(Sheet1!$A$2:$Z,0,MATCH(A1,Sheet1!1:1,0)),"")),",")))
Rest of Headers
Then for each header, will they each always only be in 1 exclusively or 2 (not both)? Assuming so, this should work for each additional column. If two values ever exist in the two sheets, will join them in the same column.
=filter(
iferror(VLOOKUP($A$2:$A,Sheet1!$A:$Z,match(E$1,Sheet1!1:1,0),false),"")
&iferror(VLOOKUP($A$2:$A,Sheet2!$A:$Z,match(E$1,Sheet2!1:1,0),false),"")
,$A$2:$A<>"")
There's probably a way to use the join function to do this more elegantly (if someone posts an answer showing me I'll upvote).

Can I filter out pivot table results that only have one row for a value in column A?

I created a pivot table in googlesheets, and it returns results that look like:
first | second | CountOf3
--------------------------
thing | value | 23
| newVal | 3
| cool | 34
that | value | 234
otherThing | cool | 4
| newVal | 345
And I want to filter out results with just one resulting row for the item in the first column.
So in this example, that would be the row: that | value | 234.
I would like the filter to remove that row, and leave the remaining rows. This is a pivot table in a 2nd sheet that updates when Sheet1 changes.
I have been trying all day, and have not been able to come up with a solution. I was hoping there would be some sort of filter, or spreadsheet formula to do this. I've tried multiple combinations of filters, but nothing seems to work - I'm starting to wonder if this is even possible.
It isn't pretty, but a brute force way is to have a check column beside your pivot table, with this formula on the first data row, ie beside "thing | value | 23".
It flags each row where the subsequent cell in column D is not blank. Then use a query (or filter) to list only the output rows you want. Note that you would hide the columns or rows with the actual (unfiltered) pivot output.
This is the simplest version, to see the logic:
=AND(LEN(D3),LEN(D4))
which results in a TRUE value for pivot chart rows that only have one value.
A more elegant version is an arrayformula, adds the header lable, and uses "Skip" as the flag for which rows to filter out.
={"Better Check";ARRAYFORMULA(IF(LEN(D3:D998)*LEN(D4:D999)*LEN(E3:E998),"Skip",))}
Note that this formula allows for a pivot table result effectively to the bottom of the sheet, but it does have a finite range, due to the constraint of checking two rows at once. It could be enhanced by using a COUNTA on the third data column to measure the exact length of the pivot table results and control the range dynamically, Like this:
={"Better Check";
ARRAYFORMULA( IF( LEN(INDIRECT("D3:D" & (COUNTA(F$3:F)+ROW(F$2)))) *
LEN(INDIRECT("D4:D" & (COUNTA(F$3:F)+1+ROW(F$2)))),
"Skip",))}
Let us know if this helps at all.

select less than and replace with value in column A

I have a table with a few thousand rows and columns, it looks sort of like this
this:
ID Distance1 Distance2
1 102 101
2 101 100
3 100 99
4 99 98
5 98 97
...
I would like to select all values/distances in columns B and C that are less than 100 and replace them with the value in column A (their ID number).
All distances above 100 I want to delete. The real table has several thousand columns. How can I do this?
I have tried using search and replace, and conditional formatting where I have tried creating new rule using Index + Match but I encounter errors.
Assuming ID is in A1 of Sheet1, Copy the headings row into A1 of a new sheet and in B2 of that sheet:
=IF(AND(Sheet1!B2<100,Sheet1!B2>0),Sheet1!$A2,"")
Copy across and down to suit, Select the new sheet, Copy, Paste Special, Values over the top.
This above treats 100 as more than 100 and assumes no 0 or lesser values.

SPSS: Inconsistent totals due to rounding of numbers

I am using weights when running the data with SPSS custom tables.
Thus it is expected that the column or row values may not add up to row total, column total or Table Total due to rounding of decimals
sample table result:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 115
Total 105 107 211
Is there a way to force SPSS to output the correct row, column, or table totals?
expected table output:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 116
Total 105 108 213
If you are using the CROSSTABS procedure to produce these figures then you should do using the option ASIS.
To be clear: the total displayed by CTABLES is mathematically correct. However, if you want to display as the total the sum of the displayed values in the rows, instead, the only way to do this is by using the STATS TABLE CALC extension command to recompute the totals using the rounded values.
Here is how to do that.
First, you need to create a Python module named customcalc.py with the following contents
def custom(datacells, ncells, roworcol):
'''Calculate sum of formatted values'''
total = sum(float(datacells.GetValueAt(roworcol,i)) for i in range(ncells))
return(total)
This file should be saved in the python\lib\site-packages directory under your Statistics installation or anywhere else that Python can find it.
Then, after your CTABLES command, run this syntax
STATS TABLE CALC SUBTYPE="customtable" PROCESS=PRECEDING
/TARGET custommodule="customcalc"
FORMULA="customcalc.custom(datacells, ncells, roworcol)" DIMENSION=COLUMNS LEVEL = -2 LOCATION="Total"
LABEL="Rounded Count".
That custom function adds up the formatted values in each row instead of the full precision values. If you have suppressed the default statistic name, Count, so that "Total" is the innermost label, use LEVEL=-1 instead of LEVEL=-2 ABOVE.

Is there a multiple-and-add formula in Google's spreadsheet?

What I want is to easily multiply a number by another number for each column and add them up at the end in Google Sheets. For example:
User | Points 1 | Points 2 | Points 3 | Total
| 5 | 1 | 4 |
-----+----------+----------+----------+------
Jane | 2 | 3 | 0 | 13 (2*5 + 3*1 + 0*4)
John | 1 | 11 | 4 | 32 (1*5 + 11*1 + 4*4)
So it's easy enough to make this formula for the total:
= B3*$B$2 + C3*$C$2 + D3*$D$2
The problem is I frequently need to insert additional columns or even remove some columns. So then I have to mess with all the formulas. It's a pain... we have many spreadsheets with these formulas. I wish there was a formula like SUM(B3:D3) where I could just specify a range. Is there anything like MULTIPLY_AND_SUM(B2:D2, B3:D3) that would do this? Then I could insert columns in the middle and the range would still work.
There is a built in function in Google Sheets that does exactly what you are looking for: SUMPRODUCT.
In your example the formula would be:
=sumproduct(B$2:D$2,B3:D3)
Click here for more information about this function.
You can accomplish that without requiring a special-purpose function.
In E3, try this (and copy it to the rest of your rows):
=sum(arrayformula(B3:D3*B$2:D$2))
You can read about arrayformula here.
As long as you introduce new columns between B and D, this formula will automatically adjust. If you add new columns outside of that range, you'll need to edit (and cut & paste).
On it's own, arrayformula(B3:D3*B$2:D$2) operates over each value in B3:D3 in turn, multiplying it by the corresponding value in B$2:D$2. (Note the use of absolute references to 'lock down' to row 2.) The result in this case is three values, [10,3,0], arranged horizontally in three rows because that matches the dimensions of the ranges.
The enveloping sum() function adds up the values of the array produced by arrayformula, which is 13 in this case.
As you copy that formula to other rows, the relative range references get updated for the new row.

Resources