So, I have a relatively simple query which does a summation of values on 11 different columns. The query itself runs very fast but it seems that grouping by 11 columns is extremely slow.
This is an excerpt from an explain file, it shows that query ran for two seconds but spent almost 17 seconds grouping the data. Is there anything to optimize this?
Query statistics:
-----------------
Table map :
----------------------------
Internal name Table name
----------------------------
t1 ustroj
t2 tfinkbilanca
t3 tfinkkontogodina
t4 tbaseanalitikanaziv
type table rows_prod est_rows rows_scan time est_cost
-------------------------------------------------------------------
scan t1 77 12 77 00:00.00 2
type table rows_prod est_rows rows_scan time est_cost
-------------------------------------------------------------------
scan t2 76231 77376 76231 00:00.36 90
type table rows_prod est_rows rows_scan time est_cost
-------------------------------------------------------------------
scan t3 152462 19203 76231 00:00.67 1
type rows_prod est_rows time est_cost
-------------------------------------------------
nljoin 152462 41266 00:01.05 337835
type rows_prod est_rows time est_cost
-------------------------------------------------
nljoin 76231 644 00:01.08 37456
type table rows_prod est_rows rows_scan time est_cost
-------------------------------------------------------------------
scan t4 151780 33 75890 00:00.77 0
type rows_prod est_rows time est_cost
-------------------------------------------------
nljoin 76231 632 00:01.86 38828
type rows_prod est_rows rows_cons time
-------------------------------------------------
group 15158 316 76231 00:16.55
type rows_sort est_rows rows_cons time
-------------------------------------------------
sort 55 316 15158 00:16.93
First, I suggest to you validate the time of the query "saving" it into a temporary table (include the into temp tmp01 with no log) , which you answer at the comments the time don't change.
Considering the large fields mentioned (char(256)) as description of something , my suggestion is :
At group by use only key columns. Use they id/codes if possible.
Execute the SQL with "group by" saving into a temporary table.
Execute other SQL over this temporary table and join to get the descriptions. (char(256) fields)
please, accept the answer if they really work for you... if is useful but not solve your problem only set +1 for it
Related
Good Morning,
Sorry I am completely self taught so am probably missing something simple, I am trying to create a table based on values from other tables. I am not sure how to best explain what I want so here is the example;
Table1
Name Lname Issue1 Issue2 Issue3
Tom Smith 1234 1258 1175
Dick Scott 1258 1158 1852
Jane Davis 1234 1385 1111
Sarah Bennet 1158 1672 1234
Table2
Issue Desc
1234 A
1258 B
1175 C
1158 D
1852 E
1385 F
1111 G
1672 H
1468 I
Want
Name Lname Issue1 Desc1 Issue2 Desc2 Issue3 Desc3
Tom Smith 1234 A 1258 B 1175 C
Dick Scott 1258 B 1158 D 1852 E
Jane Davis 1234 A 1385 F 1111 G
Sarah Bennet 1158 D 1672 H 1234 A
I have done this previously by doing multiple joins to a single table but it seems like there should be a better way, here is what I am currently using
Proc SQL;
Select
a.Name
a.Lname
a.Issue1
b.Desc as Desc1
a.Issue2
c.Desc as Desc2
a.Issue3
d.Desc as Desc3
From work.Table1 a
Left Join work.Table2 b
on a.Issue1 eq b.Desc
Left Join work.Table2 c
on a.Issue2 eq c.Desc
Left Join work.Table2 d
on a.Issue3 eq d.Desc
So basically I want a table that has data from both but need multiple descriptions from Table 2 to match the issue values from table 1.
Thank you for your help!
You should transpose your data from wide to long, as e. g. in this example using PROC TRANSPOSE. It is often better to have data in the "long" format, e. g. to use BY-grouping in statistical procedures.
First sort the BY-variables.
proc sort data=have;
by Name Lname;
run;
Then transpose all variables Issue1-3.
proc transpose data=have out=want;
by Name Lname;
var Issue:;
run;
Then join with Table2.
Create a format from Table 2
Use an Array in a data step to create the new columns in Table 1 if required, or apply format.
data issue_fmt;
set table2;
start=issue;
label=desc;
fmtname='$Issue_fmt';
type='C';
run;
proc format cntlin=issue_fmt;
run;
*apply format;
proc print data=table1 (obs=10);
var issue1-issue3;
format issue1-issue3 $issue_fmt.;
run;
*create new variable with format;
data want;
set have;
array issues(*) issue1-issue200;
array desc(200) desc1-desc200;
do i=1 to dim(issues);
desc(i) = put(issues(i), $issue_fmt.);
end;
run;
Hello all Sheet users out there.
I have a sheet with a list of resources with their production and usage being calculated on the left side and the overall prod/use being monitored on the right side.
A B C D | E F G H
1 Input In Output Out | Resource totIn totOut effective
2 Iron 20 FeIngot 30 | Iron 30 =SUMIF(...) =totIn-totOut
3 Copper 20 CuIngot 20 | Copper 25 =SUMIF(...) =totIn-totOut
4 Stone 10 Gravel 50 | CuIngot =SUMIF(...) =SUMIF(...) =totIn-totOut
5 FeIngot 10 FePlate 5 | FeIngot =SUMIF(...) =SUMIF(...) =totIn-totOut
6 CuIngot 25 Wire 75 | Stone 45 =SUMIF(...) =totIn-totOut
7 CuIngot 10 Cable 20 | Gravel =SUMIF(...) =SUMIF(...) =totIn-totOut
The actual sheet would look more like this:
A B C D | E F G H
1 Input In Output Out | Resource totIn totOut effective
2 Iron 20 FeIngot 30 | Iron 30 20 10
3 Copper 20 CuIngot 20 | Copper 25 20 5
4 Stone 10 Gravel 50 | CuIngot 20 35 -15
5 FeIngot 10 FePlate 5 | FeIngot 30 10 20
6 CuIngot 25 Wire 75 | Stone 45 10 35
7 CuIngot 10 Cable 20 | Gravel 50 0 50
On the left side, I want to mark all cells in column "In" red that have a negative effective production calculated on the right side. I thought about using the conditional formatting, looping through every text cell in the "Resource" column to find the one that equals the "Input" of the same row the cell I want to check is in and then check if the "effective" value of the "Resource" I found is less than 0. The problem is that I don't know how to loop through the values and store the matching row to check if the H value is negative.
Example 1: B6 is checked. A6 needs to be compared to every cell in E2:E and when there is a match, in this case E4, check if H4 is negative. It is, so there is formatting applied.
Example 2: B3 is checked. A3 needs to be compared to every cell in E2:E and when there is a match, in this case E3, check if H3 is negative. It is not, so there is no formatting applied.
Is there any way that I can apply this formatting in the conditional formatting tool?
Keep in mind that my sheet is much more complex than these examples and it has about 120 resources that can't all be moved in order with the left side because multiple rows can use the same resource as input or output.
Thank you in advance for every ounce of your help.
try this formula =VLOOKUP($A1,$E:$H,4,false)<0 in conditional formatting
Pyspark: I have two dataframes. First one is one column containing a long string. Second dataframe is a lookup dataframe holding some values that indicate some substring start and ends. I'd like to use the second data frame to split up the first and have a resultant dataframe with the original data and the string split values:
Dataframe A:
Data
000 456 9b
876 998 1c
Dataframe B:
Description
Start
End
Length
City
1
3
3
Country
5
7
3
IheartSpark
9
10
2
The result would be this:
Data
City
Country
IheartSpark
000 456 9b
000
456
9b
876 998 1c
876
998
1c
Dataframe b is only 30 rows or so and i was thinking of broadcasting this if possible (this will run in a cluster).
Any thoughts?
Try with crossJoin and pivot functions to get the desired output.
Example:
df.show()
#+----------+
#| Data|
#+----------+
#|000 456 9b|
#|876 998 1c|
#+----------+
df1.show()
#+-----------+-----+---+------+
#| Descr|start|end|length|
#+-----------+-----+---+------+
#| City| 1| 3| 3|
#| Country| 5| 7| 3|
#|IheartSpark| 9| 10| 2|
#+-----------+-----+---+------+
from pyspark.sql.functions import *
df.crossJoin(broadcast(df1)).\
withColumn("nn",expr("""substring(Data,start,length)""")).\
groupBy("Data").\
pivot("Descr").\
agg(first(col("nn"))).\
show()
#+----------+----+-------+-----------+
#| Data|City|Country|IheartSpark|
#+----------+----+-------+-----------+
#|000 456 9b| 000| 456| 9b|
#|876 998 1c| 876| 998| 1c|
#+----------+----+-------+-----------+
If a user ordered same product with two different order_id;
The orders are created within a same date-hour granularity, for example
order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21
In the data warehouse, should we put them into two rows like this (Option 1):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 1 |
| 002 | 1111 | 22 | 123 | 456 | 10 | 2 |
Or just put them in one row with the aggregated quantity (Option 2):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 3 |
I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.
Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.
Is there a principle where I can refer ?
Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.
I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.
In my database I have many columns that will summarize in:
Code input
Amount 1
Amount 2
Code Phase
Code Sector
Code Group
Take an example of the rows that I have:
+--------------+----------+----------+-------+--------+-------+
| code_input | amount_1 | amount_2 | phase | sector | group |
+--------------+----------+----------+-------+--------+-------+
| 0171090150 | 22 | 14 | 09 | 90 | 10 |
| 0258212527 | 12 | 99 | 08 | 30 | 20 |
| 0359700504 | 30 | 10 | 09 | 20 | 20 |
+--------------+----------+----------+-------+--------+-------+
The user has a place in which he can decide who goes first, second, third and fourth. So, he can decide if the code_phase is first, second cod_sector, third code_group, finally the code_input. Or the user can play with that order (cod_sector first, code_phase second, etc).
In my database, inputs are those amounts recorded. Therefore, if a sector includes 2 inputs, the total of this sector is the sum of these two inputs.
Example of result with one order:
# => Order: Phase, Sector, Group, Input
- Phase 09 52 24
- Sector 90 22 14
- Group 10 22 14
- Input 0171090150 22 14
- Sector 20 30 10
- Group 20 30 10
- Input 0359700504 30 10
- Phase 08 12 99
- Sector 30 12 99
- Group 20 12 99
- Input 0258212527 12 99
I use Ruby on Rails and I have a code of 3579 lines for all combinations that users can put. But it is a code that is not maintainable, cumbersome and sometimes I get confused myself where there may be a mistake.
So, I wonder if any of you can help me know if there is any gem that can help me; may not do the whole order; but help me greatly to optimize my code Or if you recommend a method or algorithm that can make this order.
Sorry for my english.