Process webserver log data and using Kafka for messaging and KSQL for processing - ksqldb

I am trying to find the position of a substring from main string but i did not get any function in Ksql db, could any one please suggest me any function is available to find the position.
3/11/20 1:32:02 PM CDT, 00005, {"value":" 472 Dynamic 11 SQL 0 Start=2020/03/11 05:51:05.730 MOdelName: SELECT DISTINCT "TIME_DIM".YEAR_DESC, "TIME_DIM".LEVEL1_KEY FROM "TEST_TIME_DIM" WHERE (("TIME_DIM".HIER_FLAG_TEST = 59) AND ("TIME_DIM".LEVEL1_KEY = "TIME_DIM".UNIFORM_KEY_TEST) AND ("TIME_DIM".LEVEL2_KEY_TEST IS NULL) AND ("TIME_DIM".LEVEL3_KEY_TEST IS NULL) AND ("TIME_DIM".LEVEL5_KEY_TEST IS NULL)) ORDER BY "TIME_DIM".YEAR_DESC_TEST ASC, "TIME_DIM".LEVEL1_KEY_TEST ASC [nodeid=1]"}
in above message i want to extract entire sql , and trying to find the position of character "SQL" so that based on the position i can extract the sql.

INSTR will find the position of a substring in a string.
SUBSTRING will allow you to extract a substring, based on position.

Related

Duplicates in the result of a subquery

I am trying to count distinct sessionIds from a measurement. sessionId being a tag, I count the distinct entries in a "parent" query, since distinct() doesn't works on tags.
In the subquery, I use a group by sessionId limit 1 to still benefit from the index (if there is a more efficient technique, I have ears wide open but I'd still like to understand what's going on).
I have those two variants:
> select count(distinct(sessionId)) from (select * from UserSession group by sessionId limit 1)
name: UserSession
time count
---- -----
0 3757
> select count(sessionId) from (select * from UserSession group by sessionId limit 1)
name: UserSession
time count
---- -----
0 4206
To my understanding, those should return the same number, since group by sessionId limit 1 already returns distinct sessionIds (in the form of groups).
And indeed, if I execute:
select * from UserSession group by sessionId limit 1
I have 3757 results (groups), not 4206.
In fact, as soon as I put this in a subquery and re-select fields in a parent query, some sessionIds have multiple occurrences in the final result. Not always, since there is 17549 rows in total, but some are.
This is the sign that the limit 1 is somewhat working, but some sessionId still get multiple entries when re-selected. Maybe some kind of undefined behaviour?
I can confirm that I get the same result.
In my experience using nested queries does not always deliver what you expect/want.
Depending on how you use this you could retrieve a list of all values for a tag with:
SHOW TAG VALUES FROM UserSession WITH KEY=sessionId
Or to get the cardinality (number of distinct values for a tag):
SHOW TAG VALUES EXACT CARDINALITY FROM UserSession WITH KEY=sessionId.
Which will return a single row with a single column count, containing a number. You can remove the EXACT modifier if you don't need to be exact about the result: SHOW TAG VALUES CARDINALITY on Influx Documentation.

Handcrafted OData queries on Exact Online with Invantive

We are currently running a number of hand-crafted and optimized OData queries on Exact Online using Python. This runs on several thousand of divisions. However, I want to migrate them to Invantive SQL for ease of maintenance.
But some of the optimizations like explicit orderby in the OData query are not forwarded to Exact Online by Invantive SQL; they just retrieve all data or the top x and then do an orderby.
Especially for maximum value determination that can be a lot slower.
Simple sample on small table:
https://start.exactonline.nl/api/v1/<<division>>/financial/Journals?$select=BankAccountIBAN,BankAccountDescription&$orderby=BankAccountIBAN desc&$top=5
Is there an alternative to optimize the actual OData queries executed by Invantive SQL?
You can either use the Data Replicator or send the hand-craft OData query through a native platform request, such as:
insert into NativePlatformScalarRequests
( url
, orig_system_group
)
select replace('https://start.exactonline.nl/api/v1/{division}/financial/Journals?$select=BankAccountIBAN,BankAccountDescription&$orderby=BankAccountIBAN desc&$top=5', '{division}', code)
, 'MYSTUFF-' || code
from systempartitions#datadictionary
limit 100 /* First 100 divisions. */
create or replace table exact_online_download_journal_top5#inmemorystorage
as
select jte.*
from ( select npt.result
from NativePlatformScalarRequests npt
where npt.orig_system_group like 'MYSTUFF-%'
and npt.result is not null
) npt
join jsontable
( null
passing npt.result
columns BankAccountDescription varchar2 path 'd[0].BankAccountDescription'
, BankAccountIBAN varchar2 path 'd[0].BankAccountIBAN'
) jte
From here on you can use the in memory table, such as:
select * from exact_online_download_journal_top5#inmemorystorage
But of course you can also 'insert into sqlserver'.

Transform SQL JOIN SELECT to Esper EPL syntax

Let's consider a simple object with the same representation in a SQL database with properties(columns¨): Id, UserId,Ip.
I would like to prepare a query that would generate event in case that one user logs in from 2 IP adresses (or more) within 1 hour period.
My SQL looks like:
SELECT id,user_id,ip FROM w_log log
LEFT JOIN
(SELECT user_id, count(distinct ip) AS ip_count FROM w_log GROUP BY user_id) ips
ON log.user_id = ips.user_id
WHERE ips.ip_count > 1
Transformation to EPL:
SELECT * FROM LogEntry.win:time(1 hour) logs LEFT INNER join
(select UserId,count(distinct Ip) as IpCount FROM LogEntry.win:time(1 hour)) ips
ON logs.UserId = ips.UserId where ips.IpCount>1
Exception:
Additional information: Incorrect syntax near '(' at line 1 column 100,
please check the outer join within the from clause near reserved keyword 'select'
UPDATE:
I was successfuly able to create a schema, named window and insert data into it (or update it). I would like to increase the counter when a new LogEvent arrives in the .win:time(10 seconds) and decrease it when the event is leaving the 10 seconds window. Unfortunately the istream() doesn't seem to provide the true/false when the event is in remove stream.
create schema IpCountRec as (ip string, hitCount int)
create window IpCountWindow.win:time(10 seconds) as IpCountRec
on LogEvent.win:time(10 seconds) log
merge IpCountWindow ipc
where ipc.ip = log.ip
when matched and istream()
then update set hitCount = hitCount + 1
when matched and not istream()
then update set hitCount = hitCount - 1
when not matched
then insert select ip, 1 as hitCount
Is there something I missed?
In EPL I don't think it is possible to put a query into the from-part. You can change using "insert into". An EPL alternative is also a named window or table.

Faster search for records where 1st character of field doesn't match [A-Za-z]?

I currently have the following:
User (id, fname, lname, deleted_at, guest)
I can query for a list of user's by their fname initial like so:
User Load (9.6ms) SELECT "users".* FROM "users" WHERE (users.deleted_at IS NULL) AND (lower(left(fname, 1)) = 's') ORDER BY fname ASC LIMIT 25 OFFSET 0
This is fast thanks to the following index:
CREATE INDEX users_multi_idx
ON users (lower(left(fname, 1)), fname)
WHERE deleted_at IS NULL;
What I want to do now is be able to query for all Users that do not start with the letter's A-Z. I got this to work like so:
SELECT "users".* FROM "users" WHERE (users.deleted_at IS NULL) AND (lower(left(fname, 1)) ~ E'^[^a-zA-Z].*') ORDER BY fname ASC LIMIT 25 OFFSET 0
But the problem is this query is very slow and does not appear to be using the index to speed up the first query. Any suggestions on how I can elegantly make the 2nd query (non a-z) faster?
I'm using Postgres 9.1 with rails 3.2
Thanks
Updated answer
Preceding question here.
My first idea idea (index with text_pattern_ops) did not work with the regular expression in my tests. Better rewrite your query to:
SELECT *
FROM users
WHERE deleted_at IS NULL
WHERE lower(left(fname, 1)) < 'a' COLLATE "C"
OR lower(left(fname, 1)) > 'z' COLLATE "C"
ORDER BY fname
LIMIT 25 OFFSET 0;
Besides from these expressions being faster generally, your regular expression also had capital letters in it, which did not match the index with lower(). And the trailing characters were pointless while comparing to a single char.
And use this index:
CREATE INDEX users_multi_idx
ON users (lower(left(fname, 1)) COLLATE "C", fname)
WHERE deleted_at IS NULL;
The COLLATE "C" part is optional and only contributes a very minor gain in performance. It's purpose is to reset collation rules to default posix collation, which just uses byte order and is generally faster. Useful, where collation rules are not relevant anyway.
If you create the index with it, only queries that match the collation can use it. So you might just skip it to simplify things if performance is not your paramount requirement.
As an alternative to #ErwinBrandstetter's general solution, PostgreSQL supports partial indexes. You can say:
CREATE INDEX users_nonalphanumeric_not_deleted_key
ON users (id)
WHERE (users.deleted_at IS NULL) AND (lower(left(fname, 1)) ~ E'^[^a-zA-Z].*');
This index won't help for any other lookups, but it will precompute the answer for this particular query. This technique is often useful for queries that return a small, predefined subset from a much larger table, since the resulting index will disregard the vast majority of the table and contain only the rows of interest.

Linq to Sql and T-SQL Performance Discrepancy

I have an MVC web site the presents a paged list of data records from a SQL Server database. The UI allows the user to filter the returned data on a number of different criteria, e.g. email address. Here is a snippet of code:
Stopwatch stopwatch = new Stopwatch();
var temp = SubscriberDB
.GetSubscribers(model.Filter, model.PagingInfo);
// Inspect SQL expression here
stopwatch.Start();
model.Subscribers = temp.ToList();
stopwatch.Stop(); // 9 seconds plus compared to < 1 second in Query Analyzer
When this code is run, the StopWatch shows an execution time of around 9 seconds. If I capture the generated SQL expression (just before it is evaluated with the .ToList() method) and cut'n'paste that as a query into SQL Server Management Studio, the execution times drops to less than 1 second. For reference here is the generated SQL expression:
SELECT [t2].[SubscriberId], [t2].[Email], [t3].[Reference] AS [DataSet], [t4].[Reference] AS [DataSource], [t2].[Created]
FROM (
SELECT [t1].[SubscriberId], [t1].[SubscriberDataSetId], [t1].[SubscriberDataSourceId], [t1].[Email], [t1].[Created], [t1].[ROW_NUMBER]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[Email], [t0].[SubscriberDataSetId]) AS [ROW_NUMBER], [t0].[SubscriberId], [t0].[SubscriberDataSetId], [t0].[SubscriberDataSourceId], [t0].[Email], [t0].[Created]
FROM [dbo].[inbox_Subscriber] AS [t0]
WHERE [t0].[Email] LIKE '%_EMAIL_ADDRESS_%'
) AS [t1]
WHERE [t1].[ROW_NUMBER] BETWEEN 0 + 1 AND 0 + 20
) AS [t2]
INNER JOIN [dbo].[inbox_SubscriberDataSet] AS [t3] ON [t3].[SubscriberDataSetId] = [t2].[SubscriberDataSetId]
INNER JOIN [dbo].[inbox_SubscriberDataSource] AS [t4] ON [t4].[SubscriberDataSourceId] = [t2].[SubscriberDataSourceId]
ORDER BY [t2].[ROW_NUMBER]
If I remove the email filter clause, then the controller's StopWatch returns a similar response time to the SQL Management Studio query, less than 1 second - so I am assuming that the basic interface to SQL plumbing is working correctly and that the problem lies with the evaluation of the Linq expression. I should also mention that this is quite a large database with upwards of 1M rows in the subscriber table.
Can anyone throw any light on why there should be such a high (x10) performance differential and what, if anything can be done to address this?
Well not sure about that. 1M rows with a full like can take quiet time. Is Email indexed? Can you run the query with Email% instead of %Email% and see what happen?

Resources