How to get the first elements of COLLECT whithout limiting the global query? - neo4j

In a twitter like app, I would like to get only the 3 last USERS which has PUBLISH a tweet for particular HASHTAG (A,B,C,D,E)
START me=node(X), hashtag=node(A,B,C,D,E)
MATCH n-[USED_IN]->tweet<-[p:PUBLISH]-user-[FRIEND_OF]->me
WITH p.date? AS date,hashtag,user ORDER BY date DESC
WITH hashtag, COLLECT(user.name) AS users
RETURN hashtag._id, users;
This is the result I get with this query. This is good but if the friend list is big, I could have a very large array in the second column.
+-------------------------------------------+
| hashtag | users |
+-------------------------------------------+
| "paradis" | ["Alexandre","Paul"] |
| "hello" | ["Paul"] |
| "public" | ["Alexandre"] |
+-------------------------------------------+
If I add a LIMIT clause, at the end of the query, the entire result set is limited.
Because a user can have a very large number of friends, I do not want to get back all those USER, but only the last 2 or 3 which has published in those hashtags
Is the any solution with filter/reduce to get what I expect?
Running neo4j 1.8.2

Accessing sub-collection will be worked on,
meanwhile you can use this workaround: http://console.neo4j.org/r/f7lmtk
start n=node(*)
where has(n.name)
with collect(n.name) as names
return reduce(a=[], x in names : a + filter(y in [x] : length(a)<2)) as two_names
Reduce is used to build up the result list in the aggregator
And filter is used instead of the conditional case ... when ... which is only available in 2.0
filter(y in [x] : length(a)<2) returns a list with the element when the condition is true and an empty list when the condition is false
adding that result to the accumulator with reduce builds up the list incrementally

Be careful, the new filter syntax is:
filter(x IN a.array WHERE length(x)= 3)

Related

Splunk join with an in-memory record

Sorry for the lame question, I am new to Splunk.
What I am trying to do is to join my search result with a declared in the search body fake record, something like
index=...
| joint type=outer <column>
[ | <here declare a record to join with>
......
The idea is to make sure there is at least one record in the resulting search. There are the following cases expected:
the original search returns records
the original search does not return anything because the result is filtered
the original search does not return anything because the source is empty
I need to distinguish cases 2 and 3, which the join is for. The fake record will eliminate the case 3 so I will only need to filter the result.
There's a better way to handle the case of no results returned. Use the appendpipe command to test for that condition and add fields needed in later commands.
| appendpipe [ stats count | eval column="The source is empty"
| where count=0 | fields - count ]

How to count and compare amount of regex matches

I want to use Sumo Logic to count how often different APIs are called. I want to have a table with API call name and value. My current query is like this:
_sourceCategory="my_category"
| parse regex "GET.+443 (?<getUserByUserId>/user/v1/)\d+" nodrop
| parse regex "GET.+443 (?<getUserByUserNumber>/user/v1/userNumber)\d+"
| count by getUserByUserId, getUserByUserNumber
This gets correct values but they go to different columns. When I have more variables, table becomes very wide and hard to read.
I figured it out, I need to use same group name for all rexexes. Like this:
_sourceCategory="my_category"
| parse regex "GET.+443 (?<endpoint>/user/v1/)\d+" nodrop
| parse regex "GET.+443 (?<endpoint>/user/v1/userNumber)\d+"
| count by endpoint

Neo4j Query to Table Problems

My problem is I put data into Neo4j from what was essentially a large spreadsheet essentially. Now I want to be able to get that data back out in a similar tabular format.
Lets say I have some notional spreadsheet of data that went in looked something like the following.
| Artist | Album | Song | Live | Filename | Genre | Year | Source | Label |
|--------|-------|------|------|----------|-------|------|--------|-------|
| .... | ..... | .... | .... | ........ | ..... | .... | ...... | ..... |
---------------------------------------------------------------------------
The spreadsheet was a listing of files with some metadata about each file. For analytic purposes it made more sense to not have the file be at the center of the graph but rather the Albums. So that every record in the table above would map to a handful of nodes and relationships. The data model for this might look something like this:
(Song)-[_IS_ON_]->(Album)
(Artist)-[_SINGS_]->(Song)
(Album)-[IS_IN_]->(Genre)
(Song)-[_IS_IN_]->(Genre)
(Album)-[_IS_]->(Live)
(Album)-[_FROM_]-(Year)
(Album)-[_IS_ON_]->(Source)
(Label)-[_PRODUCED_]->(Album)
I am able to query a single record from my spreadsheet above using a query similar to this.
MATCH (a:Album {name: "Hells Bells"})-[r]-(b)
OPTIONAL MATCH (s:Song)<-[_SINGS_]-(aa:Artist)
RETURN *
I have two questions here.
How do I make the above query return a table that looks similar to the original normalized table? If I did RETURN b.filename, b.genre ... I get a table that has a lot of null values. It would seem I need to do a DISTINCT on each of the fields. But I am still really new to Neo4j and am not positive I understand how to do this.
It would be great if there was a way to get all the fields in all the nodes without having to type them out in the query like this RETURN b.filename, b.genre .... I think I figured this out once but I stupidly didn't save it.
I hope this was clear enough. I can't share my graph model or data so I had to make this up on the fly.
TIA
Try the following (but, since you did not state how to get the filename, that value might be missing):
MATCH
(artist:Artist)-[:_SINGS_]->(song:Song)-[:_IS_ON_]->(album:Album {name: "Hells Bells"})-[:_FROM_]-(year:Year),
(album)-[:_IS_IN_]->(genre:Genre),
(album)-[:_IS_]->(live:Live),
(album)-[:_IS_ON_]->(source:Source),
(label:Label)-[:_PRODUCED_]->(album)
RETURN *
In a RETURN clause, if you specified a node/relationship (without a property name), that would generate a map of all its properties. The above query, for example, would return a map for each matched node.
If you actually want to have a single merged map, you can use the APOC function apoc.map.mergeList. For example:
MATCH
(artist:Artist)-[:_SINGS_]->(song:Song)-[:_IS_ON_]->(album:Album {name: "Hells Bells"})-[:_FROM_]-(year:Year),
(album)-[:_IS_IN_]->(genre:Genre),
(album)-[:_IS_]->(live:Live),
(album)-[:_IS_ON_]->(source:Source),
(label:Label)-[:_PRODUCED_]->(album)
RETURN apoc.map.mergeList([artist,song,year,genre,live,source,label,album]) AS result

Comparing search results from two separate searches

I am a new to using Splunk and wanted to get some help in combining two search results and organizing it so that it displays matching information from the two searches.
So what I am doing a search for is something like the following. (I had to edit some of the info for security)
index=INDEX sourcetype=SOURCETYPE authresult (UNIQUEID)
This will provide me with several events with the necessary fields for what I am searching, but I need to compare the field UNIQUEHASH from this search with the same field of another similar search with a different UNIQUEID. I only want to get the information from UNIQUEHASH if both searches include the same value and how many times they are returned.
So if I do a search for UNIQUEID1 and get the following number of events with the following UNIQUEHASH values.
UNIQUEHASH Times
123 10
456 20
789 30
I would like to do the same search for UNIQUEID2 which provides the following UNIQUEHASH values.
UNIQUEHASH Times
123 20
789 400
With these two searches I would like to combine them in a simple table with the UNIQUEHASH and how many times each UNIQUEID returned that amount. So in this example the UNIQUEHASH w/ a value of 456 isn't included because UNIQUEID2 doesn't return any.
UNIQUEHASH UNIQUEID1 UNIQUEID2
123 10 20
789 30 400
What you're describing can be done either with join (the more "obvious" path), or stats:
join:
index=ndx1 sourcetype=srctp1 authresult=* uniquehash=* times=* uniqueid="1"
| stats count by uniquehash times
| fields - count
| rename times as unique1
| join uniquehash
[| search index=ndx1 sourcetype=srctp1 authresult=* uniquehash=* times=* uniqueid="2"
| stats count by uniquehash times
| fields - count
| rename times as unique2 ]
Note, using join is generally not suggested - the innermost search will be capped at 60s run time or 50k rows returned (so run the fastest/shortest search innermost)
Additionally, this will get very cumbersome if you need to do more than a couple "uniqueid" comparisons
stats:
index=ndx sourcetype=srctp uniquehash=* times=* uniqueid=*
| eval idkt=uniqueid+","+times
| stats values(idkt) as idkt by uniquehash
| where mvcount(idkt)>1
| mvexpand idkt
| rex field=idkt "(?<uniqueid>\S+)\s(?<times>.+)"
| table uniquehash uniqueid times

Comparing values in two columns of two different Splunk searches

I am new to splunk and facing an issue in comparing values in two columns of two different queries.
Query 1
index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**" A_to="**" A_from="**" | transaction call_id keepevicted=true | search "xyz event:" | table _time, call_id, A_from, A_to | rename call_id as Call_id, A_from as From, A_to as To
Query 2
index="abc_ndx" source="*/ jkdhgsdjk.log" call_id="**" B_to="**" B_from="**" | transaction call_id keepevicted=true | search " xyz event:"| table _time, call_id, B_from, B_to | rename call_id as Call_id, B_from as From, B_to as To
These are my two different queries. I want to compare each values in A_from column with each values in B_from column and if the value matches, then display the those values of A_from.
Is it possible?
I have run the two queries separately and exported the results of each into csv and used vlookup function. But the problem is there is a limit of max 10000 rows of data which can be exported and so I miss out lots of data as my data search has more than 10000 records.
Any help?
Haven't got any data to test this on at the moment, however, the following should point you in the right direction.
When you have the table for the first query sorted out, you should 'pipe' the search string to an appendcols command with your second search string. This command will allow you to run a subsearch and "import" a columns into you base search.
Once you have the two columns in the same table. You can use the eval command to create a new field which compares the two values and assigns a value as you desire.
Hope this helps.
http://docs.splunk.com/Documentation/Splunk/5.0.2/SearchReference/Appendcols
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eval
I'm not sure why there is a need to keep this as two separate queries. Everything is coming from the same sourcetype, and is using almost identical data. So I would do something like the following:
index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**" (A_to="**" A_from="**") OR (B_to="**" B_from="**")
| transaction call_id keepevicted=true
| search "xyz event:"
| eval to=if(A_from == B_from, A_from, "no_match")
| table _time, call_id, to
This grabs all events from your specified sourcetype and index, which have a call_id, and either A_to and A_from or B_to and B_from. Then it transactions all of that, lets you filter based on the "xyz event:" (Whatever that is)
Then it creates a new field called 'to' which shows A_from when A_from == B_from, otherwise it shows "no_match" (Placeholder since you didn't specify what should be done when they don't match)
There is also a way to potentially tackle this without using transactions. Although without more details into the underlying data, I can't say for sure. The basic idea is that if you have a common field (call_id in this case) you can just use stats to collect values associated with that field instead of an expensive transaction command.
For example:
index="abc_ndx" index="abc_ndx" source="*/jkdhgsdjk.log" call_id="**"
| stats last(_time) as earliest_time first(A_to) as A_to first(A_from) as A_from first(B_to) as B_to first(B_from) as B_from by call_id
Using first() or last() doesn't actually matter if there is only one value per call_id. (You can even use min() max() avg() and you'll get the same thing) Perhaps this will help you get to the output you need more easily.

Resources