Efficiently store multiple metrics in a single series

Efficiently store multiple metrics in a single series - influxdb

Is it efficient to store multiple metrics in a single series? There is support for multiple columns, but it seems that at least based on the 0.9 documentation that there is a preference towards a single series per metric and a column for value.
What I'm looking at is a way to store some related data (such as hd free, used, total) and having 3 separate series seems like a pain and would most certainly complicate queries that need to be made across the series.
Are there some general best practices for storing metrics such as these?

InfluxDB 0.9 will happily support up to 255 fields per series. The examples in the docs mostly have single field examples with a field key of "value" but there's nothing preventing you from having multiple fields. Since fields aren't indexed it should have no performance impact at all.
For example, here's a point with three field values:
{
"database": "mydb",
"points": [
{
"measurement": "disk",
"tags": {
"host": "server01",
"type": "SSD"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"free": 318465464,
"used": 682324110,
"total": 1000789574
}
}
]
}

Related

Is Firebase efficient in 'Group By' operations?

I am trying to adapt my normalized datamodel, to a Firebase-friendly datamodel.
My application primarily runs 'summary queries' on the normalised tables e.g. things analogous to:
'SELECT ctryid, avg(age) FROM users GROUP BY ctryid'
Having an integer ctryid speeds things up tremendously,
and in my mind it's because he has to compare an integer ctryid, instead of strings (USA, FRA, ITA, ...).
Now I learned that Firebase generates keys like 'Xj34Fhe2sP0'. Would that indeed imply less efficiency as compared to my SQL queries?
What would such a query look like in Firebase? I do not wish to denormalize any calculated results.
Edit: Denormalizing for avoiding costly joins, would imply also including ctryname in the users object right?
Thanks a lot.

Firebase doesn't support group-by clauses in its queries, nor any other aggregation operations. I don't think the keys that it generates are very important though.
What I often recommend is that you model your database to reflect what you show in the screens of your app. So your SQL queries seems to delivery a list of country IDs with the average age of the users in each country ID.
If that's what you want to show, consider storing exactly that data in Firebase:
averageAgeByCountryId: {
"NL": 43.3,
"US": 38.1
}
Not to write this data you'll need to update the existing average each time you write a new user to a country. To allow that you'll probably want to instead store the total number of users in each country, and their total age:
averageAgeByCountryId: {
"NL": { userCount: 5, sumOfAge: 217 },
"US": { userCount: 10, sumOfAge: 381 }
}
Now you can still easily calculate the average age, but in this format it is also easier to update the average ages as you add users.

AWS Glue - avro to parquet - Glue job getting an empty frame from catalog

I'm using an AWS Glue Crawler to crawl a rough 170 GB of avro data to create a Data Catalog table.
There are a couple different schema versions in the avro data but the crawler still manages to combine the data into a single table (I have enabled the "Group by data compatibility and schema similarity - mode").
Here is when things get problematic.
I can only use Athena to run a SELECT COUNT(*) FROM <DB>.<TABLE> query on the data - any other query raises the following error:
GENERIC_INTERNAL_ERROR: Unknown object inspector category: UNION
A brief Google check leads me to believe that this has something to do with the schema in the avro files.
Normally, this is where I would focus my efforts BUT: I have been able to do this exact same procedure(AVRO -> crawler -> Glue job -> PARQUET) before, with a smaller avro data set (50GB) having the same issue(only being able to run a count query). Moving on.
The the conversion job previously took about an hour. Now, when running the same job on the 170 GB data, the job finishes in a minute because glueContext.create_dynamic_frame.from_catalog now returns an empty frame - no errors, no nothing. The confusion is real as I am able to run a COUNT query in Athena on the same table that the job is using, returning a count of 520M objects.
Does anyone have an idea what the problem might be?
A couple of things that might be relevant:
The COUNT query returns 520M but the recordCount in table properties says 170M records.
The data is stored in 300k .avro files with size 2MB-30MB
Yes, the crawler is pointed to the folder with all the files, not to a file (common crawler gotcha).
The previous attempt with a smaller data set(50 GB) was 100% successful - I could crawl the parquet data and query it with Athena (tested many different queries, all working)

We had the same issue and could solve it as follows.
In our avro schema there was a record with mixed field types, i.e., some were of the form "type" : [ "string" ], others of the form "type" : [ "null", "string" ].
Changing this manually to [ "null", "string" ] everywhere, we were able to use the table in Athena without any issues.

Cross JOIN collections and GroupBy CosmosDB Javascript API

I am searching for a solution in the Javascript API for CosmosDB, where you can perform an INNER/OUTER JOIN between two document collections.
I have been unsuccessful.
From my understanding, Javascript Stored Procedures run within a collection, and cannot access/reference data in another collection.
If the above is true, where does this leave our application's datasource that has been designed in a relational way? If Business requires a immediate query, to collect pthe following data:
All agreements/contracts that has been migrated to a new product offering, within a specific region, for a given time frame. How would I go about this query, if there are about 5 collections containing all infromation related to this query?
Any guidance?
UPDATE
Customer
{
"id": "d02e6668-ce24-455d-b241-32835bb2dcb5",
"Name": "Test User One",
"Surname": "Test"
}
Agreement
{
"id": "ee1094bd-16f4-45ec-9f5e-7ecd91d4e729",
"CustomerId": "d02e6668-ce24-455d-b241-32835bb2dcb5"
"RetailProductVersionInstance":
[
{
"id": "8ce31e7c-7b1a-4221-89a3-449ae4fd6622",
"RetailProductVersionId": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"AgreementInstanceUser": {
"FirstName": "Luke",
"LastName": "Pothier",
"AgreementUserTypeId": ""
},
"AgreementInstanceMSISDN": {
"IsoCountryDialingCode": null,
"PhoneNumber": "0839263922",
"NetworkOperatorId": "30303728-9983-47f9-a494-1de853d66254"
},
"RetailProductVersionInstanceState": "IN USE",
"IsPrimaryRetailProduct": true,
"RetailProductVersionInstancePhysicalItems": [
{
"id": "f8090aba-f06b-4233-9f9e-eb2567a20afe",
"PhysicalItemId": "75f64ab3-81d2-f600-6acb-d37da216846f",
"RetailProductVersionInstancePhysicalItemNumbers": [
{
"id": "9905058b-8369-4a64-b9a5-e17e28750fba",
"PhysicalItemNumberTypeId": "39226b5a-429b-4634-bbce-2213974e5bab",
"PhysicalItemNumberValue": "KJDS959405"
},
{
"id": "1fe09dd2-fb8a-49b3-99e6-8c51df10adb1",
"PhysicalItemNumberTypeId": "960a1750-64be-4333-9a7f-c8da419d670a",
"PhysicalItemNumberValue": "DJDJ94943"
}
],
"RetailProductVersionInstancePhysicalItemState": "IN USE",
"DateCreatedUtc": "2018-11-21T13:55:00Z",
"DateUpdatedUtc": "2020-11-21T13:55:00Z"
}
]
}
]
}
RetailProduct
{
"id": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"FriendlyName": "Data-Package 100GB",
"WholeSaleProductId": "d054dae5-173d-478b-bb0e-7516e6a24476"
}
WholeSaleProduct:
{
"id": "d054dae5-173d-478b-bb0e-7516e6a24476",
"ProductName": "Data 100",
"ProviderLiabilities": []
}
Above, I have added some sample documentation.
Relationships:
Agreement.CustomerId links to Customer.id
Agreement.RetailProductVersionInstance.RetailProductVersionId links
to RetailProduct.id
RetailProduct.WholeSaleProductId links to WholeSaleProduct.id
How, would I write a Javascript Stored Procedure, in CosmosDB, to perform joins between these 4 collections?

Short answer is that you cannot perform joins between different collections via SQL in Cosmos DB.
Generally, the solution to this type of question is multiple queries or different schema. In your scenario, if you can denormalize your schema into one collection without duplicating data, then it is easy.
If you provide your schemas, it'd be possible to provide a more comprehensive answer.
-- Edit 1 --
Stored Procedures are only good candidates for operations that require multiple operations on the same collection + partition key. This makes them good for bulk insert/delete/update, transactions (which need at least a read and a write), and a few other things. They aren't good for CPU intensive things, but rather things that would normally be IO bound by network latency. They aren't possible to use for cross partition or cross collection scenarios. In those cases, you must perform the operations exclusively from the remote client.
In your case, it's a fairly straightforward 2 + 2N separate reads, where N is the number of products. You need to read the agreement first. Then you can look up the customer and the product records in parallel, and then you can look up the wholesale record last, so you should have a latency of 3s + C, where s is the average duration of a given read request and C is some constant CPU time to perform the join/issue the request/etc.
It's worth considering whether you can consolidate RetailProduct and WholeSale product into a single record where Wholesale contains all the RetailProducts in an array, or as separate documents, partitioned by the wholesale id, with a well known id that contained the Wholesale product info in a separate document. That would reduce your latency by 1 third. If you go with the partitioning by wholesale id idea, you could write 1 query for any records that shared a wholesale id, so you'd get 2 + log(N) reads, but the same effective latency. For that strategy, you'd store a composite index of "wholesaleid+productid" in the agreement. One issue to worry about is that it duplicates the wholesale+product relationship, but as long as that relationship doesn't change, I don't think there is anything to worry about and it provides a good optimization for info lookup.

coredata fetch list from one table and details from another without relation

i have two entity
First one -
{
"ID": 5777,
"Name": "",
"EventID": 18341,
"ParentID": 19702,
"LastModifiedDate": "2016-11-30 09:36:04",
"EntityType": 3,
"InstanceID": 916787
}
2d one
{
"ID": 19702,
"Name": "Google",
"EventID": 18341,
"ParentID": 0,
"LastModifiedDate": "2016-12-01 06:20:49",
"EntityType": 0,
"FileAttribute": "",
"InstanceID": 0,
"IsFile": false,
"ResourceURL": "http://www.google.com",
"Taxonomies": "0",
"ViewCount": 2
}
Now need to fetch from 2nd one with "ID" is "ParentID" of first one using core data.
MySql query will "SELECT * FROM "tbl_two" WHERE `ID` IN ( SELECT `ParentID` FROM "tbl_ONE" WHERE `InstanceID` = '916787' AND `EventID` = '18341')

The first step is to stop thinking about Core Data like it's SQL. It has a different API, and thinking in SQL terms will lead you to poor designs.
In the most general terms, don't think of records, think of objects that you might use in an app and how those objects relate to each other. Instead of using foreign keys to relate one record to another, use object properties to relate one object to another.
From what I can make of your SQL query, you want something like
Two Core Data entities called Resource (your first sample) and ResourceMapping (your second sample).
Resource has a property called something like mapping, of type ResourceMapping.
You relate one to another using the relationship, not by ID. You can store the IDs if you need to sync the data to a remote server but they're usually not useful in Core Data.
The equivalent of your SQL query would be, I think:
Fetch the single instance of Resource (the first example) using an NSPredicate that matches one or more of its properties.
When you have that instance, ask it for the value of its relationship mapping (or whatever you call it), and you'll get the related object from the other entity.
Beyond that, Apple provides extensive, detailed documentation with code samples that will help.

Given 2 multivariate datasets, identify records representing the same entity, which differ slightly

Let's take the example of having 2 data sources, with data sizes "m" and "n" respectively. Both datasets are SQL tables having the same schema, but different data. Our goal is to "flag" fuzzy-matches (between the datasets) that are similar enough to consider "identical".
CREATE TABLE player(
id Integer,
fname VARCHAR(64),
lname VARCHAR(64),
birth_dt datetime,
weight Integer
)
While the majority of total combinations (m*n) will not be matches, we would like to flag "similar" matches like the following:
{"fname": "John", "lname": "Smith", "birth_dt": "6/6/91", "weight": 220}
{"fname": "Jack", "lname": "Smith", "birth_dt": "6/6/91", "weight": 210}
Are there any tools (open-sourced or not) that do a great job of identifying and flagging these "matches"?

This is a problem of "record linkage", and that keyword will help you find a big literature about the problem.
The open source, python library dedupe, provides one comprehensive approach.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart