Deal with null values from BigQuery in Google Cloud Dataflow Python

Deal with null values from BigQuery in Google Cloud Dataflow Python - google-cloud-dataflow

The issue I have is when I read data from GBQ with null values, then if I try to map any function with the column of null values, it will give errors.
When I write the input_data from GBQ to text, the json output file does not have the key with null values. I believe this is an issue that need to be fixed.
For example:
- Input
key_1,key_2,key_3
value_1,,value_3
Expected output:
{"key_1":"value_1","key_2":null,"key_3":"value_3"}
Output from Dataflow
{"key_1":"value_1","key_3":"value_3"}

For now, there is not much we can do at Dataflow level. As you pointed out the JSON coming out of BigQuery does not have the null values. This will be improved (but not in the next immediate release) if we switch to AVRO as an intermediary format for exports. You can insert a one-liner function to "clean up" the data by adding the missing nullable fields. See the example below:
def add_null_field(row, field):
row.update({field: row.get(field, None)})
return row
(p
| df.io.Read(df.io.BigQuerySource( 'PROJECT:DATASET.TABLE'))
| df.Map(add_null_field, field='value')
| df.io.Write(df.io.TextFileSink('gs://BUCKET/FILES)))
Hope this helps.

Related

How to add missing values to a string column using SimpleImputer()?

I am having a dataset which I read into a pandas dataframe.
Most of them are string columns.
Column structure of my dataframe:
['id', 'currently working', column3, column4, ....]
The column that has missing data is 'currently working'. The column contains only two values -> YES, NO and there are null values as well.
I applied the SimpleImputer() in one of my previous learning and that is on an integer column which contain salaries, where I give strategy as mean to preprocess the dataset and replace nulls like below.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
But in my current scenario, the column is of String type which I certainly can't apply any numeric function methods.
Could anyone let me know how can I preprocess the existing data and replace nulls in a string column of a pandas dataframe ?
What is the preprocessing method that should I follow when working on String columns ?

You can use most_frequent strategy. SimpleImputer will replace missing using the most frequent value. It may also be useful to use add_indicatorbool=True. In this case, the output of the imputer’s transform will stack an additional column with the value from the MissingIndicator. So, your model will have a clue that the value was missing before.
Code example:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,
strategy='most_frequent',
add_indicator=True)

Google DataFlow: Read from BigQuery, combine three string fields, write key/value fields to Google Cloud Spanner

None of the provided DataFlow templates match what I need to do, so I'm trying to write my own. I managed to run the example code like word count example without issue, so I tried to butcher together parts separate examples that read from BigQuery and writes to Spanner but there's just so many things in the source code I don't understand and cannot adapt to my own problem.
I'm REALLY lost on this and any help is greatly appreciated!
The goal is to use DataFlow and Apache Beam SDK to read from a BigQuery table with 3 string fields and 1 integer field, then concatenate the content of the 3 string fields into one string and put that new string in a new field called "key", then I want to write the key field and the integer field (which is unchanged) to a Spanner table that already exists, ideally append rows with a new key and update the integer field of rows with a key that already exists.
I'm trying to do this in Java because there is no i/o connector for Python. Any advice on doing this with Python are much appreciated.
For now I would be super happy if I could just read a table from BigQuery and write whatever I get from that table to a table in Spanner, but I can't even make that happen.
Problems:
I'm using Maven and I don't know what dependencies I need to put in the pom file
I don't know which package and import I need at the beginning of my java file
I don't know if I should use readTableRows() or read(SerializableFunction) to read from BigQuery
I have no idea how to access the string fields in the PCollection to concatenate them or how to make the new PCollection with only the key and integer field
I somehow need to make the PCollection into a Mutation to write to Spanner
I want to use an INSERT UPDATE query to write to the Spanner table, which doesn't seem to be an option in the Spanner i/o connector.
Honestly, I'm too embarrassed to even show that code I'm trying to run.
public class SimpleTransfer {
public static void main(String[] args) {
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location, and specify DataflowRunner.
options.setProject("myproject");
options.setStagingLocation("gs://mybucket");
options.setRunner(DataflowRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows()
.from(tableSpec);
// Hopefully some day add a transform
// Somehow make a Mutation
PCollection<Mutation> mutation = rowsFromBigQuery;
// Only way I found to write to Spanner, not even sure if that works.
SpannerWriteResult result = mutation.apply(
SpannerIO.write().withInstanceId("myinstance").withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
}

It's intimidating to deal with these strange data types, but once you get used to the TableRow and Mutation types, you'll be able to code robust pipelines.
The first thing you need to do is take your PCollection of TableRows, and convert those into an intermediate format that is convenient for you. Let's use Beam's KV, which defines a key-value pair. In the following snippet, we're extracting the values from the TableRow, and concatenating the string you want:
rowsFromBigQuery
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
Finally, to write to Spanner, we use Mutation-type objects, which define the kind of mutation that we want to apply to a row in Spanner. We'll do it with another MapElements transform, which takes N inputs, and returns N outputs. We define the insert or update mutations there:
myKvPairsPCollection
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
And then you can pass the output to that to SpannerIO.write. The whole pipeline looks something like this:
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
PCollection<TableRow> rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows().from(tableSpec));
// Take in a TableRow, and convert it into a key-value pair
PCollection<Mutation> mutations = rowsFromBigQuery
// First we make the TableRows into the appropriate key-value
// pair of string key and integer.
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
// Now we construct the mutations
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
// Now we pass the mutations to spanner
SpannerWriteResult result = mutations.apply(
SpannerIO.write()
.withInstanceId("myinstance")
.withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}

How to exclude the time field from Sumo Logic results?

How do I exclude the Time (_messagetime) metadata field from my result set?
I've tried:
field -_messagetime
But it gives me the error
Field _messagetime not found, please check the spelling and try again.
Using:
fields -time
does not remove the field either.
Currently I'm getting around this by using an aggregate (count) that has no effect on the data.
[EDIT]
Here's an example query:
Removing the Message (_raw) works. But removing the time (_messagetime) doesn't.
These results are used as email alerts, so removing the Time field from the Display isn't really an option.

The easiest way is to just turn off the field in the field browser window on the left-hand side of the results:
The other option is to aggregate and then remove the aggregate field - even if you just aggregate on _raw (which is the raw message):
_sourceCategory=blah
| count by _raw
| fields -_count
If you're still having trouble, can you share the rest of your query?
Edit based on your new query:
*
| parse "Description=\"*\"" as Description
| parse "Date=\"*\"" as Date
| count by Description, Date, Action
| fields -_count

The Time field is there as a result of the timeslice operation as far as I'm aware. The following should do the trick
| fields - _timeslice

How to aggregate data using apache beam api with multiple keys

I am new to google cloud data platform as well as to Apache beam api. I would like aggregate data based on multiple keys. In my requirement I will get a transaction feed having fields like customer id,customer name,transaction amount and transaction type. I would like to aggregate the data based on customer id & transaction type. Here is an example.
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
In google most of the examples are based on single key like group by single key. Can any please help me on how my PTransform look like in my requirement and how to produce aggregated data along with rest of the fields.
Regards,
Ravi.

Here is an easy way. I concatenated all the keys together to form a single key and then did the the sub and after than split the key to organize the output to a way you wanted. Please let me know if any question.
The code does not expect header in the CSV file. I just kept it short to show the main point you are asking.
import apache_beam as beam
import sys
class Split(beam.DoFn):
def process(self, element):
"""
Splits each row on commas and returns a tuple representing the row to process
"""
customer_id, customer_name, transction_amount, transaction_type = element.split(",")
return [
(customer_id +","+customer_name+","+transaction_type, float(transction_amount))
]
if __name__ == '__main__':
p = beam.Pipeline(argv=sys.argv)
input = 'aggregate.csv'
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
| 'parse' >> beam.ParDo(Split())
| 'sum' >> beam.CombinePerKey(sum)
| 'convertToString' >>beam.Map(lambda (combined_key, total_balance): '%s,%s,%s,%s' % (combined_key.split(",")[0], combined_key.split(",")[1],total_balance,combined_key.split(",")[2]))
| 'write' >> beam.io.WriteToText(output_prefix)
)
p.run().wait_until_finish()
it will produce output as below:
cust234,Srini,200.0,C
cust444,shaker,500.0,D
cust123,ravi,300.0,D
cust123,ravi,400.0,C

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp

In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Deal with null values from BigQuery in Google Cloud Dataflow Python - google-cloud-dataflow

Related

How to add missing values to a string column using SimpleImputer()?

Google DataFlow: Read from BigQuery, combine three string fields, write key/value fields to Google Cloud Spanner

How to exclude the time field from Sumo Logic results?

How to aggregate data using apache beam api with multiple keys

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

Categories

Resources