InfluxDB mathematics across measurements - influxdb

I have two measurements in my InfluxDB, say, mem_used and mem_ available.
I tried to query across those measurements and do a mathematics with
SELECT mean("mem_used_value") / mean("mem_available_value") FROM
(
SELECT mean("value") AS "mem_used_value",
mean("value") AS "mem_available_value"
FROM "dbname"."autogen"."mem_used",
"dbname"."autogen"."mem_available"
GROUP BY time(1m)
)
GROUP BY time(1m)
The result of the query is very weird, and I was wondering if it’s possible for InfluxDB to perform a mathematics across measurements.
I have did some research about this feature and found the issue 3552 Mathematics across measurements is still opening. However, it was requested three years ago.
Is there any approach to do this? any advice is welcome.

There's no JOINs in Influx QL.
Remember pls: that's NOT a relational DB, the query language may look familiar, but it is a totally different thing.
Here's what you can do.
1) The smartest & legit-iest way: shape your measurement properly.
Currently you didn't: there should not be two measurements, but one, like (in line protocol notation)
memusage,host=yourhost,othertag=something,yetanotertag=anything mem_used=123,mem_available=321 yourtimestamp
2) Use Kapacitor to join your measurements altogether.
There, you can do math right in Kapacitor, or simply write the result of the join back into a single measurement and later do your aggregations in plain InfluxQL.

Related

How to identify relevant columns in very wide tables using AI and Machine Learning?

I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!

Is Google Sheet Pivot Table Generally Slower Than QUERY Function? What are the best practices?

It seems that the GOOGLE SHEET Pivot Table is slower than running a QUERY function. Does anyone know if it's true? Is is generally better to use QUERY then Pivot Table?
My Pivot Table draws on about 8000 rows 26 cols across, which also has about 4-5 calculated fields. Is Query generally faster than Pivot Table? Since PT is a bit easier to make than complicated query functions. Not sure if the PT's slowness is due to calculated fields.
Are there any good/best practices?
Trying to get the timing by using Apps Script shows no significant difference between those methods. This means probably means that all the changes are due to the frontend. There are no official best practices for front end efficiency, but in general trying to not have massive amounts of data is important. To handle big amounts of data is actually better to use other services more specific for this cases (which one would depend on how you would use it).

how to permanently merge two measurements in InfluxDB

Have a set of temperature sensors in my house and I've just upgraded them, to new devices.
They record exactly the same way into my InfluxDB (ie temperature and humidity every 15 mins), but the device names are different in the InfluxDB.
I'm keen not to lose years of history, so I'd like to rename all my historical records from TempSensor to the new name which is ESP_TempSensor (and thus merge the records) - there's no overlap as I literally swapped the devices and data format is identical.
I've googled and I know InfluxDB doesn't seem very capable at joins and other simple things, but in this case I'm happy to manually and permanently merge the datasets.
Any pointers/help much appreciated!
You can use the INTO clause:
SELECT * INTO ESP_TempSensor FROM TempSensor GROUP BY *
Make sure you include the GROUP BY * otherwise influx will convert all tags into field keys.

How to get descriptive statistics on questionnaire items by group using SPSS?

I have carried out an evaluation of a product using likert scale questionnaire and imported the date into SPSS. I have my columns arranged as follows:
ID, Group, Q1, Q2, Q3, Q4
I have two different groups completing the questionnaire, with each person a different numerical ID. Under the Q columns, I have the score given for that person (from 1-5) from the Likert Scale.
In all there are over 300 responses.
I am running analysis using 'descriptive statistics/frequencies' from the menubar and not getting the tables I am looking for. Basically, it is including all respondents together, whereas I would like it to compare the two groups in the tables.
How can I get descriptive statistics on questionnaire items by group using SPSS?
In addition, if you have any further tips as to what analysis I could perform on this type of data in SPSS I'd be most grateful. I'd like to show that there isn't a significant difference in opinions between the groups, and from looking at the data, it appears that this is the case.
One option
split file by group
run descriptive statistics as usual
See this SPSS FAQ item from UCLA on how to analyze data by categories.
The short answer to you question is, crosstabs Q1 to Q4 by group. will produce the table you want. Or if you have the ctables package available a more compact table will be produced by
variable level group_id Q1 to Q4 (nominal).
ctables
/table Q1 + Q2 + Q3 + Q4 by group_id.
Either can be elaborated on to produce other statistics if wanted. It seems to me a chi-square test would be sufficient for your question.
As far as further analysis it is a bit of an open-ended question that needs more focus to be able to effectively answer. I frequently suggest visual exploration for such exploratory analysis, and hence I would suggest perusing this question on the site, Visualizing Likert responses using R or SPSS for potential ideas about how to visualize the responses. Another motivating post may be How to visualize 3D contingency matrix?.
There are a ton of other questions related to analyzing likert responses on this site though, and it is difficult to give any more specific advice without a more specific motivation for the analysis.
While the above answers all have their good points, I usually prefer this procedure (type the following into a syntax window and Run):
means q1 to q4 by group/stat anova.
This will give you group means, sample sizes, and standard deviations as well as tests of the difference in means between the groups, for each of the variables Q1 to Q4. Of course, the tests will only give you valid results to the extent that your data meet the standard assumptions of anova. Some may say that variables measured on an ordinal 1-5 scale are not suitable for anova, and in academic contexts this is often true, but in business contexts most people are willing to sacrifice some rigor for the sake of convenience. It's much more convenient to compare 4x2=8 means than it is to compare the distributions of 4x5x2=40 categories of responses.
This can easily be done by using the "Crosstabs" function in SPSS for Windows:
Analyze --> Descriptive Statistics --> Crosstabs. Move the dependent variable(s) into the "Row(s)" box, then move the grouping variable into the "Column(s)" box, then click OK.

Can 2 Cubes in a Data Warehouse be directly compared against each other?

Is there a way to compare all information (aggregates, down to the detail level) between two OLAP cubes? For example, say I wanted to compare one cube created to work with sql server 2000 to that same cube, but migrated to run on sql server 2005/2008 - technically they should both return the same information for all dimension / measure combinations but I need a way to verify.
I am definitely NOT a developer, but I do have access to enterprise manager, and potentially SAS tools etc. and I know a bit of SQL but not much else. I know that you can compare two dimensional (i.e. tables) data sets with sql queries, and also with SAS - but I have never heard of a way to compare three dimensional cubes.
Am I out of luck on this one? The last thing that I want to have to do is view both cubes and compare all possible results side by side via excel or something, I hope that it can be automated somehow.
Comparing cubes means doing enough "slice-and-dice" queries to prove that you've queried all of the facts.
You can, simply, get a sum and count of the various fact and dimension tables. If those are the same, odds are good that any particular query will be the same between the two.
Without details on the dimensions and facts in question, it's hard to make a more specific recommendation.
However, consider that you can easily compute a set of subtotals for each dimension of the cube. If the dimensions are the same number of rows, the results will be the same number of rows. If the grand total is the same, then all that's left is row-by-row comparison of the subtotals.
If you do this once for each dimension, you should have some confidence that they're the same. Or, you'll find a difference that you can explore with more detailed queries.
The best approach is to compare the cube data by interchanging the rows and columns and verifying if all the counts and totals match properly.
For example, if you are having year-wise totals for a particular location, it would be a good approach to interchange the values between locations and the months and verifying whether they match properly.

Resources