Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
📂state-data/
|-📜AL.zip
|-📜AK.zip
|-📜...
|-📜WY.zip
📂county-data/
|-📂AL/
|-📜COUNTY1.csv
|-📜COUNTY2.csv
|-📜...
|-📜COUNTY68.csv
|-📂AK/
|-📜...
|-📂.../
|-📂WY/
|-📜...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.
I) PROBLEM
Let’s say I have a matrix like this with distances (in kilometers) between the homes of different people.
| | Person 1 | Person 2 | Person 3 |
|----------|----------|----------|----------|
| Person 1 | | | |
| Person 2 | 24 | | |
| Person 3 | 17 | 153 | |
And I have a data table like this:
| Person | Party |
|----------|----------|
| Person 1 | Party A |
| Person 2 | Party B |
| Person 3 | Party C |
I want to do multidimensional scaling (dissimilarity by distance) to visualize i) how close each person lives to another; ii) which party each person votes for (different colors for each party)
II) CURRENT RESULT
My current plot of MDS (made with SPSS) is like this (I don’t use a code line, but a menu commands in SPSS).
:
III) EXPECTED RESULT
I want to add a different color for each person depending on which party this person votes for:
IV) QUESTION(S)
Can I do it in SPSS? How to add the data about votes in the matrix and how to show it in MDS plot?
EDIT
There is quite the same problem and solution for R.
R) Create double-labeled MDS plot
But I want to do it in SPSS.
I don't believe it's possible to create a plot like the one you show directly from either of the MDS procedures currently available in SPSS Statistics, PROXSCAL or ALSCAL. I think what you'd need to do would be to save the common space coordinates to a new dataset or file, then add the Party variable to that new dataset or file, define it as Nominal in the measurement level designation in the Data Editor, and then use the Grouped Scatter option under Scatter/Dot in the chart Gallery in the Chart Builder, defining groups by the Party variable.
The PROXSCAL procedure lets you save things from the dialogs in the Output sub-dialog. The ALSCAL procedure only supports saving out of common space coordinates and other things using command syntax, specifically using the OUTFILE subcommand (you can paste the command from the dialogs, then add this subcommand).
I have the following in my Google cloud storage
Advertiser | Event
__________________
100 | Click
101 | Impression
100 | Impression
100 | Impression
101 | Impression
My output of the pipeline should be something like
Advertiser | Count
100 | 3
101 | 2
First I used groupByKey, the output is like
100 Click, Impression, Impression
101 Impression, Impression
How to proceed from here?
Instead of a GroupByKey, you may want to use a combine function, which is a composite that optimizes before and after the group by key. Your pipeline can look something like this:
Python
collection_contents = [(100, 'Click'),
(101, 'Impression'),
(100, 'Impression'),
(100, 'Impression'),
(101, 'Impression']
input_collection = pipeline | beam.Create(collection_contents)
counts = input_collection | Count.PerKey()
This should output a collection with the shape you are looking for. The Count series of transforms is available in the apache_beam.transforms.combiners.combine.Count module.
Java
The same transforms exist for Java in the org.apache.beam.sdk.transforms package:
PCollection<KV<Integer, Integer>> resultColl = inputColl.apply(Count.perKey())
This counting pattern has been described in the 'word count' sample of Apache Beam.
Find the sample at Github apache beam sample: wordcount.py. The counting starts at line 95.
Using Google Sheets, I want to automatically number rows like so:
The key is that I want this to use built-in functions only.
I have an implementation working where child items are in separate columns (e.g. "Foo" is in column B, "Bar" is in column C, and "Baz" is in column D). However, it uses a custom JavaScript function, and the slow way that custom JavaScript functions are evaluated, combined with the dependencies, possibly combined with a slow Internet connection, means that my solution can take over one second per row (!) to calculate.
For reference, here's my custom function (that I want to abandon in favor of native code):
/**
* Calculate the Work Breakdown Structure id for this row.
*
* #param {range} priorIds IDs that precede this one.
* #param {range} names The names for this row.
* #return A WBS string id (e.g. "2.1.5") or an empty string if there are no names.
* #customfunction
*/
function WBS_ID(priorIds,names){
if (Array.isArray(names[0])) names = names[0];
if (!names.join("")) return "";
var lastId,pieces=[];
for (var i=priorIds.length;i-- && !lastId;) lastId=priorIds[i][0];
if (lastId) pieces = (lastId+"").split('.').map(function(s){ return s*1 });
for (var i=0;i<names.length;i++){
if (names[i]){
var s = pieces.concat();
pieces.length=i+1;
pieces[i] = (pieces[i]||0) + 1;
return pieces.join(".");
}
}
}
For example, cell A7 would use the formula:
=WBS_ID(A$2:A6,B7:D7)
...to produce the result "1.3.2"
Note that in the above example blank rows are skipped during numbering. An answer that does not honor this—where the ID is calculated determinstically from the ROW())—is acceptable (and possibly even desirable).
Edit: Yes, I've tried to do this myself. I have a solution that uses three extra columns which I chose not to include in the question. I have been writing equations in Excel for at least 25 years (and Google Spreadsheets for 1 year). I have looked through the list of functions for Google Spreadsheets and none of them jumps out to me as making possible something that I didn't think of before.
When the question is a programming problem and the problem is an inability to see how to get from point A to point B, I don't know that it's useful to "show what I've done". I've considered splitting by periods. I've looked for a map equivalent function. I know how to use isblank() and counta().
Lol this is hilariously the longest (and very likely the most unnecessarily complicated way to combine formulas) but because I thought it was interesting that it does in fact work, so long as you just add a 1 in the first row then in the second row you add:
=if(row()=1,1,if(and(istext(D2),counta(split(A1,"."))=3),left(A1,4)&n(right(A1,1)+1),if(and(isblank(B2),isblank(C2),isblank(D2)),"",if(and(isblank(B2),isblank(C2),isnumber(indirect(address(row()-1,column())))),indirect(address(row()-1,column()))&"."&if(istext(D2),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+0.1,)),if(and(isblank(B2),istext(C2)),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+0.1,2),if(istext(B2),round(max(indirect(address(1,column())&":"&address(row()-1,column())))+1,),))))))
in my defense ive had a very long day at work - complicating what should be a simple thing seems to be my thing today :)
Foreword
Spreadsheet built-in functions doesn't include an equivalent to JavaScript .map. The alternative is to use the spreadsheets array handling features and iteration patterns.
A "complete solution" could include the use of built-in functions to automatically transform the user input into a simple table and returning the Work Breakdown Structure number (WBS) . Some people refer to transforming the user input into a simple table as "normalization" but including this will make this post to be too long for the Stack Overflow format, so it will be focused in presenting a short formula to obtain the WBS.
It's worth to say that using formulas for doing the transformation of large data sets into a simple table as part of the continuous spreadsheet calculations, in this case, of WBS, will make the spreadsheet to slow to refresh.
Short answer
To keep the WBS formula short and simple, first transform the user input into a simple table including task name, id and parent id columns, then use a formula like the following:
=ArrayFormula(
IFERROR(
INDEX($D$2:$D,MATCH($C2,$B$2:$B,0))
&"."
&COUNTIF($C$2:$C2,C2),
RANK($B2,FILTER($B$2:B,LEN($C$2:$C)=0),TRUE)&"")
)
Explanation
First, prepare your data
Put each task in one row. Include a General task / project to be used as the parent of all the root level tasks.
Add an ID to each task.
Add a reference to the ID of the parent task for each task. Left blank for the General task / project.
After the above steps the data should look like the following:
+---+--------------+----+-----------+
| | A | B | C |
+---+--------------+----+-----------+
| 1 | Task | ID | Parent ID |
| 2 | General task | 1 | |
| 3 | Substast 1 | 2 | 1 |
| 4 | Substast 2 | 3 | 1 |
| 5 | Subsubtask 1 | 4 | 2 |
| 6 | Subsubtask 2 | 5 | 2 |
+---+--------------+----+-----------+
Remark: This also could help to reduce of required processing time of a custom funcion.
Second, add the below formula to D2, then fill down as needed,
=ArrayFormula(
IFERROR(
INDEX($D$2:$D,MATCH($C2,$B$2:$B,0))
&"."
&COUNTIF($C$2:$C2,C2),
RANK($B2,FILTER($B$2:B,LEN($C$2:$C)=0),TRUE)&"")
)
The result should look like the following:
+---+--------------+----+-----------+----------+
| | A | B | C | D |
+---+--------------+----+-----------+----------+
| 1 | Task | ID | Parent ID | WBS |
| 2 | General task | 1 | | 1 |
| 3 | Substast 1 | 2 | 1 | 1.1 |
| 4 | Substast 2 | 3 | 1 | 1.2 |
| 5 | Subsubtask 1 | 4 | 2 | 1.1.1 |
| 6 | Subsubtask 2 | 5 | 2 | 1.1.2 |
+---+--------------+----+-----------+----------+
Here's an answer that does not allow a blank line between items, and requires that you manually type "1" into the first cell (A2). This formula is applied to cell A3, with the assumption that there are at most three levels of hierarchy in columns B, C, and D.
=IF(
COUNTA(B3), // If there is a value in the 1st column
INDEX(SPLIT(A2,"."),1)+1, // find the 1st part of the prior ID, plus 1
IF( // ...otherwise
COUNTA(C3), // If there's a value in the 2nd column
INDEX(SPLIT(A2,"."),1) // find the 1st part of the prior ID
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),2),0)+1, // add the 2nd part of the prior ID (or 0), plus 1
INDEX(SPLIT(A2,"."),1) // ...otherwise find the 1st part of the prior ID
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),2),1) // add the 2nd part of the prior ID or 1 and
& "." // add a period and
& IFERROR(INDEX(SPLIT(A2,"."),3)+1,1) // add the 3rd part of the prior ID (or 0), plus 1
)
) & "" // Ensure the result is a string ("1.2", not 1.2)
Without comments:
=IF(COUNTA(B3),INDEX(SPLIT(A2,"."),1)+1,IF(COUNTA(C3),INDEX(SPLIT(A2,"."),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),2),0)+1,INDEX(SPLIT(A2,"."),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),2),1)& "."& IFERROR(INDEX(SPLIT(A2,"."),3)+1,1))) & ""
I'm trying to generate a line chart with data I take from a database.
The data basically have a date field, an estimated progress field and a real progress field.
The progresses may be nil but the date is always there.
Since I don't know what are the intervals of the date and I need the intervals of the date distributed uniformly , I want to make the data from the first date until the last date with steps of 1 day.
For example, let's say I have this in the database:
| date | estimated progress | real progress |
| 2012-08-01 | 0.0 | |
| 2012-08-02 | | 0.15 |
| 2012-08-05 | 0.3 | |
I would like to generate a line chart with this info:
x = [2012-08-01, 2012-08-02, 2012-08-03, 2012-08-04, 2012-08-05]
ep = [0.0 , 0.0, 0.0, 0.0, 0.3]
rp = [nil , 0.15, 0.15, 0.15, 0.15 ]
But since the start date and the finish date can be way too separated, I'd like to show the x labels with a custom interval. It could be every 3, 5 or 7 days depending on the distance between those dates.
I'm trying this with gchartrb which use the google chart api but I realized I can't have nil values inside my data. So I should replace it with 0.0 even though it's not 0. It's unknown.
The other problem I found is that I don't know how to specify the labels to show those intervals I said before. It just show me every label and therefore, it's not readable.
I'm looking for another gem, a solution for gchartrb or ideas to generate the data differently and make it understandable.
Maybe you should check this link:
http://railscasts.com/episodes/223-charts.
Here is also good library for charts:
Flotr 2 and gem for it flotr2-rails.
For Flotr 2 it is worth to check example with time/dates labels on axis.