How to get muliple files as apache beam input? - google-cloud-dataflow

Am working on this scenario :
In Google Cloud Storage my files are store in this structure :
PS*: the 2 files are in the same folder (it was an indent mistake)
what i want to do is:
1] read the 2 files "client_info.csv" + "client_events.csv" from each day
2] join columns based on a common column inside each file to get 1 pcollection
3] doing transformations
4] load data to bigquery
I wrote a code that read only from 1 date and it works well, But i couldn't solve the part of iteration over all dates
if you have any suggestion, please provide it .

A solution may be to consider a pipeline that merges two branches. In each branch you consider one input file separately and then you join them.
Please check out the illustration and the sample code available here

Related

Combining multiple spreadsheet files in to one master overview (read only)

Let me start by saying that I know too little about coding etc to translate some of the solutions given on this platform to solve my issue. So hopefully someone can help me get started..
I am trying to combine a certain section of multiple google spreadsheet files with multiple tabs into one file. The name and number of the various tabs are different (and change over time).
To explain. We have for various person an overview of their projects (each project on its own tab). Each project/tab contains a number of to do's. What I need to achieve is to import al the to do's to a master list so that we have 1 master overview (basically a big to do list that I can sort on date).
Two exmples with dummy information. The relevant information starts on line 79
https://docs.google.com/spreadsheets/d/1FsQd9sKaAG7hKynVIR3sxqx6_yR2_hCMQWAWsOr4tj0/edit?usp=sharing
https://docs.google.com/spreadsheets/d/155J24uQpRC7uGvZEhQdkiSBnYU28iodAn-zR7rUhg1o/edit?usp=sharing
Since this information is dynamic and you are restricted from using app script, you can create a "definitions" or "parameters" sheet where the person must either report the NAMES of their projects and the ROW the tasks starts on and total length. From there you can use importrange function to get their definitions. From the definitions you can use other import range functions to get their tasks list. Concatenating it is gonna be a pretty big issue for you though.
This unfortunately would be much easier for you to accomplish with a different architecture to your docs / sheets. The more a spreadsheet looks like a database (column heads and rows of data that match those headers), the easier they are to work with. The more they look like forms / paper worksheets the more code you would need to parse that format.

Reorganizing Google Sheets data dynamically

I'm currently working with Google Sheets to import data from Contact Form 7 in Wordpress. All the data is coming over fine, but I wanted to see about formatting it in more user friendly fashion. I've simplified the example a bit, but the gist of the form I have created allows the user to request multiple versions of a graphic file with different wording as needed, up to 5(my example has just 2 for simplicity sake).
All the data is imported using the CF7 variables and ideally I wanted to clean this up a bit. What I had thought of as a solution was creating a second sheet that pulls in this data submitted in the first sheet into a more user friendly format, as I intended to use this as a work form for a designer to create the requested graphic once the data is received. With each request the name/department/email/date all stay the same, but I'd like to display the version and line 1 and 2 data on another line. Is it possible to reorganize data like this on the fly, so when a new form is submitted and adds data to sheet 1, sheet 2 would then update with the properly formatted info?
Is this even possible to do? I did some looking online, but didn't anything that really related to this type of data manipulation.
Solution:
Here's what ended up working for my example
=ArrayFormula(QUERY({
Sheet1!A2:D,Sheet1!E2:G,ROW(Sheet1!A2:A);
IFERROR(LEN(Sheet1!A2:D)/0),Sheet1!H2:J,ROW(Sheet1!A2:A);
IFERROR(LEN(Sheet1!A2:D)/0),Sheet1!K2:M,ROW(Sheet1!A2:A);
IFERROR(LEN(Sheet1!A2:D)/0),Sheet1!N2:P,ROW(Sheet1!A2:A);
IFERROR(LEN(Sheet1!A2:D)/0),Sheet1!Q2:S,ROW(Sheet1!A2:A)
},"select Col1,Col2,Col3,Col4,Col5,Col6,Col7 where Col5<>'' order by Col8",1))
Yes, it's possible.
One way is to use arrays and the QUERY function.
For simplicity, let say that
Columns A and B have the general information of the order
Columns C and D have the data for version 1
Columns E and F have the data for version 2
Columns G and H have the data for version 3
On the output sheet, add the headers.
Below of them add a formula like the following:
=ArrayFormula(QUERY({A2:B,C2:D,ROW(A2:A);IFERROR(LEN(A2:B)/0),E2:F,ROW(A2:A);IFERROR(LEN(A2:B)/0),G2:H,ROW(A2:A)},"select Col1,Col2,Col3,Col4 where Col3<>'' order by Col5"))
References start on row 2 to skip the headers to avoid to include them on the output sheet.
ROW(A2:A) is used to keep the order
IFERROR(LEN(A2:B)/0) is a "trick" used to "hide" the order (general information) data for the second and following rows for the same order. On the select parameter of the QUERY function, it's referrey as Col5 on the order by clause.
It's assumed that lookup-choice-1 will never be empty.
NOTES:
If more columns were added, the column numbers should be updated accordingly
Don't use the order by clause to sort the result by the general information columns because the "trick" to hide the "labels". If you need to apply a sort, do it' before applying the above formula, you could do this by sorting the source range through the Data > Sort range... feature, so the data is sorted before it's transformed by the above formula.
See also
Sort and filter your data, an official help article describing Data > Sort range...

Split a KV<K,V> PCollection into multiple PCollections

Hi after performing a group by key on a KV Pcollection, I need to:-
1) Make every element in that PCollection a separate individual PCollection.
2) Insert the records in those individual PCollections into a BigQuery Table.
Basically my intention is to create a dynamic date partition in the BigQuery table.
How can I do this?
An example would really help.
For Google Dataflow to be able to perform the massive parallelisation which makes it as one of its kind (as a service on the public cloud), the job flow needs to be predefined before submitting it to on the Google cloud console. Everytime you execute the jar file that conatins your pipleline code (which includes pipeline options and the transforms), a json file with the description of the job is created and submitted to Google cloud platform. The managed service then uses this to execute your job.
For the use case mentioned in the question, it demands that the input PCollection be split into as many PCollections as their are unique dates. For the split, the Tuple Tags needed to split the collection should be created dynamically which is not possible at this time. Creating tuple tags dynamically is not allowed because that doesn't help in creating the job description json file and beats the whole design/purpose with which dataflow was built.
I can think of a couple of solutions to this problem (both having its own pros and cons) :
Solution 1 (a workaround for the exact use case in the question):
Write a dataflow transform that takes the input PCollection and for each element in the input -
1. Checks the date of the element.
2. Appends the date to a pre-defined Big Query Table Name as a decorator (in the format yyyyMMDD).
3. Makes an HTTP request to the BQ API to insert the row into the table with the table name added with a decorator.
You will have to take into consideration the cost perspective in this approach because there is single HTTP request for every element rather than a BQ load job that would have done it if we had used the BigQueryIO dataflow sdk module.
Solution 2 (best practice that should be followed in these type of use cases):
1. Run the dataflow pipeline in the streaming mode instead of batch mode.
2. Define a time window with whatever is suitable to the scenario in which it is being is used.
3. For the `PCollection` in each window, write it to a BQ table with the decorator being the date of the time window itself.
You will have to consider rearchitecting your data source to send data to dataflow in the real time but you will have a dynamically date partitioned big query table with the results of your data processing being near real time.
References -
Google Big Query Table Decorators
Google Big Query Table insert using HTTP POST request
How job description files work
Note: Please mention in the comments and I will elaborate the answer with code snippets if needed.

How do I use Power BI Desktop with version control?

Greetings beloved comrades,
I am building a series of power bi dashboards, and as they go into production I'd like to put them into TFS. However, due to the large datasets involved, some of these report definitions are quite large (1.6GB).
It doesn't seem like a good idea to force TFS to store all of the actual data, when only the definition really matters.
Is there a simple way to remove the data from a .pbix file or save only the definition?
Edit: Looks like Microsoft has rendered this question obsolete with the creation of PowerBI templates. April Update for PowerBI
Nevertheless, the workaround in the answer could be used for other purposes.
I would add a "Parameters" Query (a table with a single row - created using Edit Queries / Edit Data) with a column called [Data Load], with a single row containing "Yes".
Then I would add a Filter step to the end of all the other Queries, referring to that "Parameters" Query. The Filter syntax would be:
Parameters{0}[Data Load] = "Yes"
That syntax is a bit obscure - it means:
Go to the Parameters Query, get the value from the 1st row, in the [Data Load] column, test if it equals "Yes".
When you want to empty all the data from the .pbix file, edit the Source step in the "Parameters" Query and change the [Data Load] value to "No", Apply and Refresh.
I've built a working example of this which you can download from my OneDrive and try out:
http://1drv.ms/1AzPAZp
It's the file: Power BI Demo - Dynamically filter all data.pbix
Convert the pbix files to a pbit file using "Save As..." option, and then version those pbit files in TFS, using Visual Studio, but controlling them on the server.
This approach is a bit interesting. When you commit a .pbix it uploads it to Premium, extracts the JSON metadata for the model, and then commits that back to DevOps next to the .pbix. That way you can see a diff over time of the model metadata including Power Query changes, measure changes, etc.

Syntax to extract variable labels from SPSS file

I have hundreds of SPSS .sav files. For each one I want to extract the variable NAMES and variable LABELS as a two column table to a csv file. I know that this is straightforward by simply copying and pasting from the "Variable view" window, but I would really like to know how to do this using syntax. Is this possible?
Many thanks in advance for any help!
You might be interested in the GATHERMD extension command, It takes a wildcard for the file names and builds a dataset with three variables: the file name, the variable name, and the variable label. You could then just save that as a csv file.
This command requires the Python Essentials available with your Statistics installation or via the SPSS Community website (www.ibm.com/developerworks/spssdevcentral).
Using native Statistics syntax, DISPLAY DICTIONARY and CODEBOOK might be helpful, but they won't give you all this information in one table.

Resources