Riak: search by key prefix - erlang

I'm a newcomer to Riak and I've been reading this chapter from riak's docs. It goes to show that by adding structure information to buckets and keys one can overcome some of the limitations of key/value operations.
Though the article states an example on how such key would be structured:
sensor data keys could be prefaced by sensor_ or temp_sensor1_ followed by a timestamp
(e.g. sensor1_2013-11-05T08:15:30-05:00)
no method is mentioned on how to query the data by key prefix (e.g sensor1_). Looking around stackoverflow I found this question. In it MapReduce and key filtering are mentioned as a possible solution. But the documentation on key filters states that they are a soon-to-be deprecated feature. I also checked out Riak search as a possible way but wasn't able to find a way to query data by key prefix.
My question is: What is the best way to search data by key prefix? I would greatly appreciate an example.

The best way to search for a key prefix is to not do it if you don't need to, i.e. design around that search pattern if you can. The primary way to do that is to use deterministic keys that your application can easily compute. That said, if you cannot avoid building your application to require searching on key prefixes there are couple of things you can do (all of which have their drawbacks).
Key Filters - http://docs.basho.com/riak/latest/dev/references/keyfilters/ - as you noted already these are marked as deprecated and not recommended at this point.
MapReduce - http://docs.basho.com/riak/latest/dev/advanced/mapreduce/ - a good option if you can query in batches but not really suited for real time querying. You could cache the query results if precomputing the queries is helpful.
Riak Search 2.0 (Solr) - http://docs.basho.com/riak/latest/dev/using/search/ - this is probably the easiest method to implement from an application perspective and allows to query your keys using a query along the lines of: 'curl "$RIAK_HOST/search/sensor?wt=json&q=_yz_rk:sensor1_*"'. Using search does come with a performance hit over straight key based queries but you can cache queries.
Data Modeling - querying by key directly is always going to provide the best performance as mentioned above. One option is to to take advantage of Riak's Data Types (CRDTs) and create a bucket that uses sets. You could create a set for each sensor that contained a list of keys associated with that sensor in the first bucket. Then you can iterate over the keys in the set and do a multi-get to return all of associated records.
Hope this gives you some ideas.

Related

Is there a way in SumoLogic to store some data and use it in queries?

I have a list of IPs that I want to filter out of many queries that I have in sumo logic. Is there a way to store that list of IPs somewhere so it can be referenced, instead of copy pasting it in every query?
For example, in a perfect world it would be nice to define a list of things like:
things=foo,bar,baz
And then in another query reference it:
where mything IN things
Right now I'm just copying/pasting. I think there may be a way to do this by setting up a custom data source and putting the IPs in there, but that seems like a very round-about way of doing it, and wouldn't help to re-use parts of a query that aren't data (eg re-use statements). Also their template feature is about parameterizing a query, not re-use across many queries.
Yes. There's a notion of Lookup Tables in Sumo Logic. Consult:
https://help.sumologic.com/docs/search/lookup-tables/create-lookup-table/
for details.
It allows to store some values (either manually once, or in a scheduled way as as a result of some query) with | save operator.
And then you can refer to these values using | lookup which is conceptually similar to SQL's JOIN.
Disclaimer: I am currently employed by Sumo Logic.

Indices in Neo4j - questions and doubts

The only indices that I know about them are indices on properties (these indices are created on particular labels (node types)). I have some doubts, however.
Are there exists indices on edges/relationships?
I often read that Neo4j leveraged Lucene Index. Is it still used? What is aim?
Are there exists any other indicses than indices on properties?
Thanks in advance,
Neo4j has two indexing systems.
The more modern one is referred to as "schema indexes", and these are the ones that are automatic and apply to properties of a given label for quick lookup by those properties when the given properties and label are provided within a query. This does not currently support indexing of relationship properties. These started out based on lucene, but we've gradually replaced the implementation with our own native indexing solution. Discussion of these, as well as any noteworthy information and limitations, can be found in our index configuration documentation.
The other indexing system is an older manual system that is called "explicit indexes", though this has previously been called "manual indexes". This is also based on lucene, but these are not automatic -- it is up to the user to manually add or remove entries to the index and keep them up to date when data in the database changes. This makes usage and maintenance cumbersome, and we recommend avoid using this system if possible.
Built-in procedures are the means to create and lookup using explicit indexes, as these are never used automatically under the hood (as opposed to schema indexes). APOC Procedures also offers various means of interfacing with explicit indexes.
The main reason one would use explicit indexes is because you are able to create an index on relationships for properties and get fast lookup when querying the index. This also allows for a full text lookup across multiple labels and properties, provided the index has been configured in such a way.
Separate from all of these, it should be noted that usage of labels is itself a kind of index, as it provides quick access to all nodes with the given label.

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

Search a document in Elasticsearch by a list of Wildcarded statements on a single field

If I have documents in ElasticSearch that have a field called url and the contents of the url field are strings like "http://www.foo.com" or "http://www.bar.com/some/url/segment/the-page.html", is it possible to search for documents matching a list of wildcarded url fragments e.g., ["http://www.foo.", "http://www.bar.com//segment/.html", "://*bar.com/**"]?
If it is possible, what is the best approach to do this? I have explored wildcard query which only seems to support 1 fragment not multiple. Filters don't seem to support wildcarding as I have tried using * in a term filter without any luck.
To make it a little more complex, I'm also interested in being able to search by a lot of these fragments. I have come across terms filter lookup which seems like it is a good solution for dealing with many search terms, but I'm not sure wildcarding works with filters.
Any thoughts?

surveymonkey Where is qtype and respondent_id in the get_survey_details extract?

I'm trying to replicate the survey monkey relational database format (A relational database view of your data with a separate file created for each database table. Knowledge of SQL (Structured Query Language) is necessary.) to download responses for our reporting analytics using the Survey Monkey API. However I'm not able to find the QType and respondent_id data in the get_survey_details API extract method. Can someone help?
1.QType is found in the Questions.xls data in the current relational database format download.
I was able to find all of the other data in the Questions.xls data in the get_survey_details API (question_id, page_id, position, heading) but not QType.
2.Respondent_id is found in the Responses.xls data in the the relational database format download.
I can see that respondent_id is in the get_responses API method but that does not have the associated Key1 data that I also need. Key1 data is answer_id data in the get_survey_details API which is why I expected to find the corresponding respondent_id there as well.
SurveyMonkey's deprecated relational database download (RDD) format and API provide data using very different paradigms. Using the API to recreate the RDD format in order to work with an old integration is probably a poor use of time. A more productive idea would be to use the API to build a more modern integration from the ground-up taking advantage of things like real-time data availability to modernize the functionality. But if you're determined:
You will need to map the family and subtype of the question type to the QTypes you're used to. The information you need to build the mapping can be found on SurveyMonkey's developer portal in Data Types.
get_responses returns answer_id as row and/or col. For matrix question types, you will have both which cross reference to and answer and answer items from get_survey_details. For matrix questions, you might consider concatenating the row and col to create a single unique key value like the Key1 you're accustomed to.
I've done this. It got over the immediate need when the RDD format was withdrawn.
Now that I have more time, I'm looking at a better design but as always backwards compatibility with a large code base is the drag.
To answer your question on Qtype, see my reply at
What are the expected values for the various "ENUM" types returned by the SurveyMonkey API?

Resources