Restrict Sumo Logic search to one timeslice bucket - sumologic

I have logs being pushed to sumo logic once every day, but other co-workers have the ability to force a push to update statistics. This causes an issue where some sumo logic searches will find and return double (or more) than what is expected due to finding more than one message within the allocated time range.
I am wondering if there is some way I can use timeslice so that I am only looking at the last set of results within a 24h period?
My search that works when there is only one log in 24h:
| json field=_raw "Policy"
| count by policy
| sort by _count
What I am trying to achieve:
| json field=_raw "Policy"
| timeslice 1m
| where last(_timeslice)
| count by policy
| sort by _count

Found a solution, not sure if optimal.
| json field=_raw "Policy"
| timeslice 1m
| count by policy , _timeslice
| filter _timeslice in (sort by _timeslice desc | limit 1)
| sort by _count
| fields policy, _count

If I'm understanding your question right, I think you could try something with the accum operator:
*
| json field=_raw "Policy"
| timeslice 1m
| count by _timeslice, policy
| 1 as rank
| accum rank by _timeslice
| where _accum = 1
This would be similar to doing a window partition in SQL to get rid of duplicates.

Related

When working with QuestDB, are symbol columns good for performance for huge amounts of rows each?

When working with regular SQL databases, indexes are useful for fetching a few rows, but not so useful when you are fetching a large amount of data from a table. For example, imagine you have a table with stock valuations of 10 stocks over time:
|------+--------+-------+
| time | stock | value |
|------+--------+-------+
| ... | stock1 | ... |
| ... | stock2 | ... |
| ... | ... | ... |
|------+--------+-------+
As far as I can tell, indexing it by stock (even with an enum/int/foreign key) is usually not very useful in a database like Postgres if you want to get data over a large period of time. You end up with an index spanning a large part of the table, and it ends up being faster for the database to do a sequential scan, for example, to get the average value over the whole dataset for each stock:
SELECT stock, avg(value) FROM stock_values GROUP BY stock
Given that QuestDB is row oriented, I would guess that it would result in better performance to have a separated column for each stock.
So, what schema is recommended in QuestDB for a situation like this? One column for each stock, or would a symbol column for each stock symbol be as good (or good enough) even if there are millions of results for each row?
A column per stock is not easy to achieve in QuestDB. If you create table like this
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
Then you'll have to insert all values together in one row or you end up with gaps
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
| t1 | 1.1 | | |
| t2 | | 3.45 | |
| t3 | | | 103.45 |
|----------------------------------|
Even for t1 == t2 == t3 when you do the insert as 3 operation it will still result in 3 rows.
So symbols are a better choice here.
Symbol can be indexed and not indexed and you may have benefits of non-indexed symbols when distinct number of them is low. Reading full table vs reading by index is the matter of index selectivity, not data range. If the selectivity is high (e.g. distinct symbol count is say 10k) fetching by index is faster than range scans.

What is the best way to attach a running total to selected row data?

I have a table that looks like this:
Created at | Amount | Register Name
--------------+---------+-----------------
01/01/2019... | -150.01 | Front
01/01/2019... | 38.10 | Back
What is the best way to attach an ascending-by-date running total to each record which applies only to the register name the record has? I can do this in Ruby, but doing it in the database will be much faster as it is a web application.
The application is a Rails application running Postgres 10, although the answer can be Rails-agnostic of course.
Use the aggregate sum() as a window function, e.g.:
with my_table (created_at, amount, register_name) as (
values
('2019-01-01', -150.01, 'Front'),
('2019-01-01', 38.10, 'Back'),
('2019-01-02', -150.01, 'Front'),
('2019-01-02', 38.10, 'Back')
)
select
created_at, amount, register_name,
sum(amount) over (partition by register_name order by created_at)
from my_table
order by created_at, register_name;
created_at | amount | register_name | sum
------------+---------+---------------+---------
2019-01-01 | 38.10 | Back | 38.10
2019-01-01 | -150.01 | Front | -150.01
2019-01-02 | 38.10 | Back | 76.20
2019-01-02 | -150.01 | Front | -300.02
(4 rows)

Removing duplicates in InfluxDB

I would like to perform a query to remove duplicates. What I define as a duplicate here is a measurement where we have more than 1 data point. They will have different tags, so they are not overwritten by default but I would like to remove the oldest inserted, regardless of the tags.
So for example, measurement of logins (it doesn't really make sense but it's to avoid using abstract entities):
> Email | Name | TS | Login Time
>
> a#a.com | Alice | xxxxx1000 | 2017-05-19
> a#a.com | Alice | xxxxx1000 | 2017-05-18
> a#a.com | Alice | xxxxx1000 | 2017-05-17
> b#b.com | Bob | xxxxx1000 | 2017-05-18
> c#c.com | Charlie | xxxxx1200 | 2017-05-19
I would like to remove the second and third line, because the data point has the same timestamp as the first, it is the same measurement but they have different login times and I would like to take only the last.
I know well that I could solve this with a query, but the requirement is more complex than this (visualization in Grafana of weird KPI data) and I need to remove actual duplicates (generated and loaded twice).
Thank you.
You can fetch all login user names using group by and then order by time , so that the latest login time will come up first ,then you can delete the remaining ones.
Also, you might need to copy your latest items to some another measurement , since you can't remove row in influxdb .
For this you might use limit 1 offset 0 so that only the latest login time will come from the query output.
Let me know, if I understand it correctly.

How do I know the query_group of a query which was run?

I need help in figuring out what was the query_group of a query which was run on redshift. I have set a query_group in the wlm config and want to make sure the query is getting executed from that query group.
query_group is the part of WLM(workload management) configuration which enables you to manage how to run queries through queues on the Redshift cluster. To use query_group, you have to set up your own queue with query_group name(Label) through the AWS console([Amazon Redshift] -> [Parameter Groups] -> Select parameter group -> [WLM]) or cli in advance.
Here is an example which is snipped from the Redshift doc.
set query_group to 'Monday';
select * from category limit 1;
...
reset query_group
You have to set the query_group before starting the query which you want to assign to the specific queue, and reset the query_group after finishing.
You can track the queries of query_group as following. 'label' is the name of query_group.
select query, pid, substring, elapsed, label
from svl_qlog where label ='Monday'
order by query;
query | pid | substring | elapsed | label
------+------+------------------------------------+-----------+--------
789 | 6084 | select * from category limit 1; | 65468 | Monday
790 | 6084 | select query, trim(label) from ... | 1260327 | Monday
791 | 6084 | select * from svl_qlog where .. | 2293547 | Monday
792 | 6084 | select count(*) from bigsales; | 108235617 | Monday
...
This document is good to understand how WLM works and use it.
http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html
This link is about query_group.
http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html

select distinct records based on one field while keeping other fields intact

I've got a table like this:
table: searches
+------------------------------+
| id | address | date |
+------------------------------+
| 1 | 123 foo st | 03/01/13 |
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
| 5 | 456 foo st | 03/01/13 |
| 6 | 567 foo st | 03/01/13 |
+------------------------------+
And want a result set like this:
+------------------------------+
| id | address | date |
+------------------------------+
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
+------------------------------+
But ActiveRecord seems unable to achieve this result. Here's what I'm trying:
Model has a 'most_recent' scope: scope :most_recent, order('date_searched DESC')
Model.most_recent.uniq returns the full set (SELECT DISTINCT "searches".* FROM "searches" ORDER BY date DESC) -- obviously the query is not going to do what I want, but neither is selecting only one column. I need all columns, but only rows where the address is unique in the result set.
I could do something like Model.select('distinct(address), date, id'), but that feels...wrong.
You could do a
select max(id), address, max(date) as latest
from searches
group by address
order by latest desc
According to sqlfiddle that does exactly what I think you want.
It's not quite the same as your requirement output, which doesn't seem to care about which ID is returned. Still, the query needs to specify something, which is here done by the "max" aggregate function.
I don't think you'll have any luck with ActiveRecord's autogenerated query methods for this case. So just add your own query method using that SQL to your model class. It's completely standard SQL that'll also run on basically any other RDBMS.
Edit: One big weakness of the query is that it doesn't necessarily return actual records. If the highest ID for a given address doesn't corellate with the highest date for that address, the resulting "record" will be different from the one actually stored in the DB. Depending on the use case that might matter or not. For Mysql simply changing max(id) to id would fix that problem, but IIRC Oracle has a problem with that.
To show unique addresses:
Searches.group(:address)
Then you can select columns if you want:
Searches.group(:address).select('id,date')

Resources