Comparing search results from two separate searches - comparison

I am a new to using Splunk and wanted to get some help in combining two search results and organizing it so that it displays matching information from the two searches.
So what I am doing a search for is something like the following. (I had to edit some of the info for security)
index=INDEX sourcetype=SOURCETYPE authresult (UNIQUEID)
This will provide me with several events with the necessary fields for what I am searching, but I need to compare the field UNIQUEHASH from this search with the same field of another similar search with a different UNIQUEID. I only want to get the information from UNIQUEHASH if both searches include the same value and how many times they are returned.
So if I do a search for UNIQUEID1 and get the following number of events with the following UNIQUEHASH values.
UNIQUEHASH Times
123 10
456 20
789 30
I would like to do the same search for UNIQUEID2 which provides the following UNIQUEHASH values.
UNIQUEHASH Times
123 20
789 400
With these two searches I would like to combine them in a simple table with the UNIQUEHASH and how many times each UNIQUEID returned that amount. So in this example the UNIQUEHASH w/ a value of 456 isn't included because UNIQUEID2 doesn't return any.
UNIQUEHASH UNIQUEID1 UNIQUEID2
123 10 20
789 30 400

What you're describing can be done either with join (the more "obvious" path), or stats:
join:
index=ndx1 sourcetype=srctp1 authresult=* uniquehash=* times=* uniqueid="1"
| stats count by uniquehash times
| fields - count
| rename times as unique1
| join uniquehash
[| search index=ndx1 sourcetype=srctp1 authresult=* uniquehash=* times=* uniqueid="2"
| stats count by uniquehash times
| fields - count
| rename times as unique2 ]
Note, using join is generally not suggested - the innermost search will be capped at 60s run time or 50k rows returned (so run the fastest/shortest search innermost)
Additionally, this will get very cumbersome if you need to do more than a couple "uniqueid" comparisons
stats:
index=ndx sourcetype=srctp uniquehash=* times=* uniqueid=*
| eval idkt=uniqueid+","+times
| stats values(idkt) as idkt by uniquehash
| where mvcount(idkt)>1
| mvexpand idkt
| rex field=idkt "(?<uniqueid>\S+)\s(?<times>.+)"
| table uniquehash uniqueid times

Related

How to get the list of maximum value for a column for a list of unique key

Let's say I have a table on Parse that has two columns: an identifier set by hand and a numeric property.
I need to write a query that gets me the maximum number on the numeric property per each unique identifier. So in the example below:
| identifier | value |
----------------------
| 1 | 10 |
| 2 | 5 |
| 1 | 7 |
| 2 | 9 |
I would expect the following output:
| identifier | value |
----------------------
| 1 | 10 |
| 2 | 9 |
Now I know Parse doesn't have anything like Group By statements, so this is probably not doable as a single query.
What alternative would you suggest in this case? I see some solutions, each with serious drawbacks:
Compose the result from multiple queries. This would require a query that gets the unique list of identifier and then a separate query for each identifier to get the maximum value. This will probably not scale well if the table grows in size. Also the result is not exactly consistent as the DB can change between queries (for my use case slightly stale date is not too bad). This will heavily impact the request quota limit as a single request can now trigger a large number of requests.
Keep a separate table that keeps track of this result. This table would have a single row for each identifier, containing the max value. For this I would need a beforeSave trigger that updates the second table. From what I've read there is no guarantee that beforeSave triggers are not executed concurrently so it's very tricky to ensure that I don't accidentally insert multiple values for the same identifier. I would probably have to run a background job that removes duplicates.
For my use case I'll need to get the data on an iOS device so network traffic is also an issue.
Given the constraints, I think your best option is to use CloudCode afterSave events. beforeSave can cause too much extra slowdown for your users' experience.
You can get around the concurrent triggers problem by querying the max-value table before you change it and deleting any values that exist for that identifier. Something like this:
Parse.Cloud.afterSave("Yourclass", function(request) {
var objectToSave = request.object;
query = new Parse.Query("Maxvaluetableclass");
query.equalTo("identifier", objectToSave.identifier);
query.lessThan("value", objectToSave.value);
query.find({
success: function(results) {
//Delete all the objects in the max value table with smaller values
for (var i = 0; i < results.length; i++) {
var object = results[i];
object.destroy();
}
var Maxvaluetableclass = Parse.Object.extend("Maxvaluetableclass");
var maxValueObject = new Maxvaluetableclass();
maxValueObject.identifier = objectToSave.identifier;
maxValueObject.value = objectToSave.value;
maxValueObject.save();
},
error: function(error) {
//All current values are larger, so do nothing
}
});
});
UPDATE: You'll notice that this improved setup is 'self-cleaning' - each time it's run, it removes all the smaller items. This means you don't have to run a background function
#Ryan Kreager: I don't have 50reps so I am unable to comment previous answers.
Referring to your answer, the OP should consider how frequent would be this aftersave triggered.Because if you have many records, well, each destroy() in the for loop counts as 1 API request.If I understood the pricing correctly at Parse.
https://www.parse.com/plans/faq

Getting the most recent record for each unique user in Parse using PFQuery

I'm using Parse.com, and have two classes: User and Report. A User may issue several reports during a day, but I'm only interested in the most recent one. However, I need to get all the reports that meet specific criteria, but only the most recent one.
The end result is an array of Reports, where the User is unique on each one, something like this:
ObjectId | ReportedValue | User | CreatedAt
1234 | 100 | aaaa | 2013-05-20T04:23:41.907Z
1235 | 100 | bbbb | 2013-04-29T05:10:41.907Z
1236 | 100 | cccc | 2013-05-20T02:14:41.907Z
1237 | 100 | dddd | 2013-05-19T04:03:41.907Z
So, User aaaa might have 20 reports, but I only need the most recent, for each user. However, I'm searching based on the ReportedValue being 100, and the desired result is the report objects, not the user, so I'd prefer not to go through every user.
Is this possible in Parse?
Consider using another object in the data model to assist with this. It would basically be a container with a relationship to Report. When any new report is saved, a bit of cloud code runs which:
Finds the previous latest Report for the associated user
Removes that report from the container relation
Adds the new report to the container relation
Working this way, your app can make a single, simple, query on the relation to get all of the latest Reports.
From Rest API.... works providing the user's OID is in the ACL segment in the records in the Class you are querying.
in addition to the other predicate of your query, parse can limit the number of returned rows..
--data-urlencode 'limit=1' \
--data-urlencode 'skip=0' \
For the user, if you GET the row from user table for the user you are querying
and the 'token' field value for that user and then with your report query, Set an extra header to the sessionToken value you will get ONLY THAT User's report objects.
-H "X-Parse-Session-Token: pn..." \
you will get just that user's reports
AND
results.size = 1

How can I speed up this feed-building process, which integrates multiple model types?

I have a feed that is displayed to the user, which includes 4 different model types.
It simulates grouping the entries by day, by inserting a day object into the feed.
The feed is sorted chronologically and paginated.
This is how I currently build the feed.
def get_feed initial_feed_position=0, number_of_entries_to_get=30
# display is called on each model to get them all into a standard format
feed = (first_models + second_models + third_models + forth_models).map { |feed_entry| feed_entry.display }
feed += day_entries_for_feed(feed)
end_feed_position = initial_feed_position + number_of_entries_to_get
(feed.sort_by { |feed_entry| -feed_entry[:comparison_time].to_i })[initial_feed_position...end_feed_position]
end
def day_entries_for_feed feed
# iterate over a set of dates that contain feed entries
feed.map{ |feed_entry| feed_entry[:date] }.uniq.map do |day|
# building the day object in the standard feed entry format. fields that are not relevant to this question have been left out.
{
type: 'day',
comparison_time: (day + 24.hours - 1.second).time # to ensure that the day appears above it's corresponding entries in the feed, the comparison time is set to 1 second before the day ends
}
end
end
Over time, the number of objects in the system has built up, and now the feed takes a long time to build using this method. Is there a better way to do it?
I'm using Rails 3.2.13, Ruby 1.9.3 & PostgreSQL 9.1.9.
Because you're getting all the entries in the database a lot of models are loaded into memory, to solve this problem you would have to look in to UNION (which is a pain to maintain and you will have to have literal SQL in your codebase). A good example of it is here: PosgreSQL: How to union 3 tables sorted by date
Another option would be to derive a base class and do the querying on this. Which would result in something like this:
+-------------+
| BASE FEED |
| |
+------^------+
|
+-------------------+--------+---------+----------------+
| | | |
+-----+-------+ +------+------+ +-------+-----+ +------+-----+
| MODEL ONE | | MODEL TWO | | MODEL THREE | | MODEL FOUR |
| | | | | | | |
+-------------+ +-------------+ +-------------+ +------------+
Once you have your models set up like this it's a simple matter of querying this base table. Which could look something like this:
def get_feed(initial_feed_position = 0, number_of_entries_to_get = 30)
feeds = BaseFeed.
limit(number_of_entries_to_get).
offset(initial_feed_position).
order("DATE(date_field) DESC")
end
The above example is not the exact solution but if you elaborate a bit more on what you're trying to get as a result set I can adjust it but it's more about the approach to take.
Hope this helps.
Solution without changing the DB:
The reason your code is getting slower is that you query all the objects, and then taking
only the top 30 (number_of_entries_to_get)
Because it's a feed, we can assume that most of the times, users will look at the first few pages.
Instead of taking all the first_models/second_models etc, you can take the newest end_feed_position straight from the db (order by date)
Something like:
models = FirstModel.order("date DESC").limit(end_feed_position)
models += SecondModel.order("created_at DESC").limit(end_feed_position)
So if, for example, you are on the page 2, and searching for a feed of 30:
You only query 240 objects from the db (first_models + second_models + third_models + forth_models) * 60 and it is places 30..60 in these 240 are the the 30..60 of all the objects (so it wont get slower as the db is growing)

How to get the first elements of COLLECT whithout limiting the global query?

In a twitter like app, I would like to get only the 3 last USERS which has PUBLISH a tweet for particular HASHTAG (A,B,C,D,E)
START me=node(X), hashtag=node(A,B,C,D,E)
MATCH n-[USED_IN]->tweet<-[p:PUBLISH]-user-[FRIEND_OF]->me
WITH p.date? AS date,hashtag,user ORDER BY date DESC
WITH hashtag, COLLECT(user.name) AS users
RETURN hashtag._id, users;
This is the result I get with this query. This is good but if the friend list is big, I could have a very large array in the second column.
+-------------------------------------------+
| hashtag | users |
+-------------------------------------------+
| "paradis" | ["Alexandre","Paul"] |
| "hello" | ["Paul"] |
| "public" | ["Alexandre"] |
+-------------------------------------------+
If I add a LIMIT clause, at the end of the query, the entire result set is limited.
Because a user can have a very large number of friends, I do not want to get back all those USER, but only the last 2 or 3 which has published in those hashtags
Is the any solution with filter/reduce to get what I expect?
Running neo4j 1.8.2
Accessing sub-collection will be worked on,
meanwhile you can use this workaround: http://console.neo4j.org/r/f7lmtk
start n=node(*)
where has(n.name)
with collect(n.name) as names
return reduce(a=[], x in names : a + filter(y in [x] : length(a)<2)) as two_names
Reduce is used to build up the result list in the aggregator
And filter is used instead of the conditional case ... when ... which is only available in 2.0
filter(y in [x] : length(a)<2) returns a list with the element when the condition is true and an empty list when the condition is false
adding that result to the accumulator with reduce builds up the list incrementally
Be careful, the new filter syntax is:
filter(x IN a.array WHERE length(x)= 3)

Comparing values in two columns of two different Splunk searches

I am new to splunk and facing an issue in comparing values in two columns of two different queries.
Query 1
index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**" A_to="**" A_from="**" | transaction call_id keepevicted=true | search "xyz event:" | table _time, call_id, A_from, A_to | rename call_id as Call_id, A_from as From, A_to as To
Query 2
index="abc_ndx" source="*/ jkdhgsdjk.log" call_id="**" B_to="**" B_from="**" | transaction call_id keepevicted=true | search " xyz event:"| table _time, call_id, B_from, B_to | rename call_id as Call_id, B_from as From, B_to as To
These are my two different queries. I want to compare each values in A_from column with each values in B_from column and if the value matches, then display the those values of A_from.
Is it possible?
I have run the two queries separately and exported the results of each into csv and used vlookup function. But the problem is there is a limit of max 10000 rows of data which can be exported and so I miss out lots of data as my data search has more than 10000 records.
Any help?
Haven't got any data to test this on at the moment, however, the following should point you in the right direction.
When you have the table for the first query sorted out, you should 'pipe' the search string to an appendcols command with your second search string. This command will allow you to run a subsearch and "import" a columns into you base search.
Once you have the two columns in the same table. You can use the eval command to create a new field which compares the two values and assigns a value as you desire.
Hope this helps.
http://docs.splunk.com/Documentation/Splunk/5.0.2/SearchReference/Appendcols
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eval
I'm not sure why there is a need to keep this as two separate queries. Everything is coming from the same sourcetype, and is using almost identical data. So I would do something like the following:
index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**" (A_to="**" A_from="**") OR (B_to="**" B_from="**")
| transaction call_id keepevicted=true
| search "xyz event:"
| eval to=if(A_from == B_from, A_from, "no_match")
| table _time, call_id, to
This grabs all events from your specified sourcetype and index, which have a call_id, and either A_to and A_from or B_to and B_from. Then it transactions all of that, lets you filter based on the "xyz event:" (Whatever that is)
Then it creates a new field called 'to' which shows A_from when A_from == B_from, otherwise it shows "no_match" (Placeholder since you didn't specify what should be done when they don't match)
There is also a way to potentially tackle this without using transactions. Although without more details into the underlying data, I can't say for sure. The basic idea is that if you have a common field (call_id in this case) you can just use stats to collect values associated with that field instead of an expensive transaction command.
For example:
index="abc_ndx" index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**"
| stats last(_time) as earliest_time first(A_to) as A_to first(A_from) as A_from first(B_to) as B_to first(B_from) as B_from by call_id
Using first() or last() doesn't actually matter if there is only one value per call_id. (You can even use min() max() avg() and you'll get the same thing) Perhaps this will help you get to the output you need more easily.

Resources