I'm struggling to find a real world example on how to use google cloud dataflow combiners to run a common ETL tasl which aggregates records on multiple keys (e.g. Date, Location) and sums values over different measures (e.g. GrossValue, NetValue, Quantity). I can only find examples with a typical Key/Value (e.g. Day/Value) aggregation. Any hints on how this is done with the Python SDK would be appreciated.
I'm not 100% sure I understand your question. Do you have separate elements you are trying to join the data together for, in which case you may wish to use CoGroupByKey? Or does a single element have multiple fields?
Hope some of this info helps,
I would suggest looking at windowing, which will allow you to subdivide a PCollection according to the timestamps of its individual elements. If you want to see all the events for particular day this may be useful. Python examples of windowing. You may want to window across a days worth of data. This link is useful as well to understand how you can use GroupByKey in different ways,
Another option is to determine what date your elements belongs to, and use a group by key to key it with "[location][date][other]". You may need to do something like this if you want to join the data based on multiple fields.
See this GroupByKey example, but change the key to use your multiple fields concatenated.
Here is an example for reducing with a custom combiner. You can add logic here to do a custom aggregation for multiple different measurements.
Related
I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!
I have a list of IPs that I want to filter out of many queries that I have in sumo logic. Is there a way to store that list of IPs somewhere so it can be referenced, instead of copy pasting it in every query?
For example, in a perfect world it would be nice to define a list of things like:
things=foo,bar,baz
And then in another query reference it:
where mything IN things
Right now I'm just copying/pasting. I think there may be a way to do this by setting up a custom data source and putting the IPs in there, but that seems like a very round-about way of doing it, and wouldn't help to re-use parts of a query that aren't data (eg re-use statements). Also their template feature is about parameterizing a query, not re-use across many queries.
Yes. There's a notion of Lookup Tables in Sumo Logic. Consult:
https://help.sumologic.com/docs/search/lookup-tables/create-lookup-table/
for details.
It allows to store some values (either manually once, or in a scheduled way as as a result of some query) with | save operator.
And then you can refer to these values using | lookup which is conceptually similar to SQL's JOIN.
Disclaimer: I am currently employed by Sumo Logic.
I’m using a bucket for collecting tick data for multiple symbols in Binance (e.g. ETH/BTC and BNB/BTC) and storing on different measurements (binance_ethbtc and binance_bnbbtc respectively) and that’s working fine. Other than that, I’d like to make aggregations of OHLC data into another bucket, just like this guy here. I’ve already managed to write Flux code for aggregating this data for a single measurement but then it got me wondering: do I need to write a task for EVERY measurement I have? Isn’t there a way of iterating over measurements in a bucket and aggregating the data into another one?
Thanks to FixTestRepeat on the InfluxDB community, I've managed to do it (and iterating over measurements is not necessary). He's showed me that if I remove the filter for the _measurement field, the query will yield as many series as there are measurements. More information here
We have several BigQuery tables that we're reading from through DataFlow. At the moment those tables are flattened and a lot of the data is repeated. In Dataflow, all operations must be idempotent, so any output only depends on the input to the function, there's no state kept anywhere else. This is why it makes sense to first group all the records together that belong together and in our case, this probably means creating complex objects.
Example of A complex object (there are many other types like this). We can have millions of instances of each type obviously:
Customer{
customerId
address {
street
zipcode
region
...
}
first_name
last_name
...
contactInfo: {
"phone1": {type, number, ... },
"phone2": {type, number, ... }
}
}
The examples we found for DataFlow only process very simple objects and the examples demonstrate counting, summing and averaging.
In our case, we eventually want to use DataFlow to perform more complicated processing in accordance with sets of rules. Those rules apply to the full contact of a customer, invoice or order for example and eventually produce a whole set of indicators, sums and other items.
We considered doing this 100% in BigQuery, but this gets very messy very quickly due to the rules that apply per entity.
At this time I'm still wondering whether DataFlow is really the right tool for this job. There are almost no examples for dataFlow that demonstrate how it's used for these type of more complex objects with one or two collections. The closest I found was the use of a "LogMessage" object for log processing, but this didn't have any collections and therefore didn't do any hierarchical processing.
The biggest problem we're facing is hierarchical processing. We're reading data like this:
customerid ... street zipcode region ... phoneid type number
1 a b c phone1 1 555-2424
1 a b c phone2 1 555-8181
And the first operation should be group those rows together to construct a single entity, so we can make our operations idempotent. What is the best way to do that in DataFlow, or point us to an example that does that?
You can use any object as the elements in a Dataflow pipeline. The TrafficMaxLaneFlow example uses a complex object (although it doesn't have a collection).
In your example you would do a GroupByKey to group the elements. The result is a KV<K, Iterable<V>>. The KV here is just an object and has a collection-like value inside. You could then take that KV<K, Iterable<V>> and turn it into whatever kind of objects you wanted.
The only thing to be aware of is that if you have very few elements that are really big you may run into some parallelism limits. Specifically, each element needs to be small enough to be processed on a single machine.
You may also be interested in withoutFlatteningResults on BigQueryIO. It only supports reading from a query (rather than a table) but it should provide the results without flattening.
There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html