Can Cypher do phonetic text search with only a part of the text, without using elastic search? - neo4j

Say I have a job as financial administrator (j:Job {name: 'financial administrator'}).
Many people use different titles for a 'financial administrator'. Therefore, I want abovementioned job as a hit, even if people type only 'financial' or 'administrator' and their input has typos (like: 'fynancial').
CONTAINS only gives results when the match is 100% - so without typos.
Thanks a lot!

First, you could try fuzzy matching with a full text index and see if it solves the issue.
An example would be:
Set up the index-
CALL db.index.fulltext.createNodeIndex('jobs', ['Job'], ['name'], {})
Query the index with fuzzy matching (note the ~)
CALL db.index.fulltext.queryNodes('jobs', 'fynancial~')
If you want to go further and use Lucene's phonetic searches, then you could write a little Java code to register a custom analyzer.
Include the lucene-analyzers-phonetic dependency like so:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-phonetic</artifactId>
<version>8.5.1</version>
</dependency>
Then create a custom analyzer:
#ServiceProvider
public class PhoneticAnalyzer extends AnalyzerProvider {
public PhoneticAnalyzer() {
super("phonetic");
}
#Override
public Analyzer createAnalyzer() {
return new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String s) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream stream = new DoubleMetaphoneFilter(tokenizer, 6, true);
return new TokenStreamComponents(tokenizer, stream);
}
};
}
}
I used the DoubleMetaphoneFilter but you can experiment with others.
Package it as a jar, and put it into Neo4j's plugin directory along with the Lucene phonetic jar and restart the server.
Then, create a full text index using this analyzer:
CALL db.index.fulltext.createNodeIndex('jobs', ['Job'], ['name'], {analyzer:'phonetic'})
Querying the index looks the same:
CALL db.index.fulltext.queryNodes('jobs', 'fynancial')

It took a while, this is how I solved my question.
MATCH (a)-[:IS]->(hs)
UNWIND a.naam AS namelist
CALL apoc.text.phonetic(namelist) YIELD value
WITH value AS search_str, SPLIT('INPUT FROM DATABASE', ' ') AS input, a
CALL apoc.text.phonetic(input) YIELD value
WITH value AS match_str, search_str, a
WHERE search_str CONTAINS match_str OR search_str = match_str
RETURN DISTINCT a.naam, label(a)

Related

How do we generate a cypher query by using Java and APOC for Neo4j?

I am trying to create my own procedure in Java in order to use it for Neo4j.I wanted to know how we can execute Cypher code in Java ?
I tried to use graphDB.execute() function but it doesn't work.
I just want to execute a basic code in Java by using Neo4j libraries.
Example of a basic code I want to execute:
[EDIT]
public class Test
{
#Context public GraphDatabaseService graphDb;
#UserFunction
public Result test() {
Result result = graphDb.execute("MATCH (n:Actor)\n" +
"RETURN n.name AS name\n" +
"UNION ALL MATCH (n:Movie)\n" +
"RETURN n.title AS name", new HashMap<String, Object>());
return result;
}
}
If you want to display nodes (as in the graphical result view in the browser), then you have to return the nodes themselves (and/or relationships and/or paths), not the properties alone (names and titles). You'll also need this to be a procedure, not a function. Procedures can yield streams of nodes, functions can only return single values.
Change this to a procedure, and change your return type to be something like Stream<NodeResult> where NodeResult is a POJO that has a public Node field.
You'll need to change your return accordingly.

Auto alignment of data in Xtext

I had one custom parser rule in which I had defined all my keywords such as _self, _for, _loop etc. Because of this, if I type _s and click Ctrl+ space bar, it shows _self.But what I required is even though I type self or SE, it should auto assign as _self.Is it possible? If so, could anyone please suggest a solution for this. Thanks in advance
There are multiple things to be payed attention to
There needs to be a proposal and only one proposal. Otherwise the user has to select the prosal to be applied and no auto insert takes places
Proposals are created based on the error recovery and so you might not get the proposal you are looking for at all
so lets assume you have a grammar like
Model:
greetings+=Greeting*;
Greeting:
'_self' name=ID '!';
and a model file like
SE
Then the error recovery will work fine an a proposal of "_self" will be added to the list of proposals
Proposals are Filtered based on the current prefix in the model. that would be the place you could start customizing.
e.g. this very naive impl
import org.eclipse.xtext.ui.editor.contentassist.FQNPrefixMatcher;
public class MyPrefixMatcher extends FQNPrefixMatcher {
#Override
public boolean isCandidateMatchingPrefix(String name, String prefix) {
return super.isCandidateMatchingPrefix(name, prefix) || super.isCandidateMatchingPrefix(name, "_" + prefix);
}
}
and dont forget to bind
import org.eclipse.xtend.lib.annotations.FinalFieldsConstructor
import org.eclipse.xtext.ui.editor.contentassist.PrefixMatcher
import org.xtext.example.mydsl4.ui.contentassist.MyPrefixMatcher
#FinalFieldsConstructor
class MyDslUiModule extends AbstractMyDslUiModule {
override Class<? extends PrefixMatcher> bindPrefixMatcher() {
return MyPrefixMatcher;
}
}
There is another feature that does not use proposals at all but the text that is actually typed and if it recognizes something then can replace it with something else. this feature is called "Auto Edit". The extension point in xtext for this is IAutoEditStrategy / AbstractEditStrategyProvider

Can I write a Neo4j Plugin to intercept and modify CYPHER queries

In my system, I would like to intercept and change Cypher queries as they come in, one alternative is to modify them before sending them from my middle layer to the graph - but is there a way to have a plugin do the conversion for me in the graph itself?
I'd like to do some of the following:
If someone identifying themselves as members of group A, imagine I'd like to change their request from:
MATCH(f:Film)-[r:REVIEWED_BY]-(u:User {id:"1337"})
to:
MATCH(p:Product)-[p:PURCHASED_BY]-(u:User {id:"1337"})
Is something like this possible? Or do I have to write the traversals in Java directly to achieve this?
Of course you can. You can do ANYTHING in Neo4j. Just grab the cypher string in an unmanaged extension that receives a post request, alter it any way you want, execute it with the graphdb.execute method and return the result as normal.
#POST
#Path("/batch")
public Response alterCypher(String body, #Context GraphDatabaseService db) throws IOException, InterruptedException {
ArrayList<Result> results = new ArrayList<>();
// Validate our input or exit right away
HashMap input = Validators.getValidCypherStatements(body);
ArrayList<HashMap> statements = (ArrayList<HashMap>)input.get("statements");
for (HashMap statement : statements) {
// write the alterQuery method to change the queries.
String alteredQuery = alterQuery((String)statement.get("statement"));
Result result = db.execute(alteredQuery, (Map)statement.getOrDefault("parameters", new HashMap<>()));
results.add(result);
}
// or go the results and return them however you want
// see https://github.com/dmontag/neo4j-unmanaged-extension-template/blob/master/src/main/java/org/neo4j/example/unmanagedextension/MyService.java#L36
return Response.ok().build();
}
At this time it's not possible to extend or modify Cypher queries.
If you need that I recommend you to use Transaction Event API - http://graphaware.com/neo4j/transactions/2014/07/11/neo4j-transaction-event-api.html
With that you should be able to change what query returns.

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

Neo4J, SDN and running Cypher spatial queries

I am new to Neo4J and I am trying to build a proof of concept for High Availability spatial temporal based querying.
I have a setup with 2 standalone Neo4J Enterprise servers and a single Java application running with an embedded HA Neo4J server.
Everything was simple to setup and basic queries are easy to setup and efficient. Additionally performing the queries derived from the Neo4J SpatialRepository work as expected.
What I am struggling to understand is how to use SDN to make a spatial query in combination with any other where clauses. As a trivial example how could I write find all places User called X has been within Y miles of lat/lon. Because the SpatialRepository is not part of the regular Spring Repository class tree I do not believe that there are any naming conventions that I can use, is the intention that I perform the spatial query and then filter the results?
I have traced the code through to a LegacyIndexSearcher (which has a name that scares me!) and cannot see any mechanism for extending the search. I have also had a look at the IndexProviderTest on GitHub which could provide a manual mechanism for performing the query against the index, except that I think there may be two indexes in play.
It might be helpful if I understood how to construct a Cypher query that I could use within an #Query annotation. Whilst I have been able to use the console to perform a simple REST query using:
:POST /db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance
{
"layer":"location",
"pointX":0.0,
"pointY":51.526256,
"distanceInKm":100
}
This does not work:
start n=node:location('withinDistance:[51.526256,0.0,100.0]') return n;
The error is:
Index `location` does not exist
Neo.ClientError.Schema.NoSuchIndex
The index was (possibly naively) created using Spring:
#Indexed(indexType = IndexType.POINT, indexName = "location")
String wkt;
If I run index --indexes in the console I can see that there is no index named location, but that there is one named location__neo4j-spatial__LayerNodeIndex__internal__spatialNodeLookup__.
Am I required to create the Index manually? If so, could someone point me in the direction of the documentation and I'll get on with it.
Assuming that it is just ignorance that has stopped me getting the simple Cypher query to run, is it as simple as adding a regular Cypher WHERE clause to the query to perform the combination of Spatial and property based querying?
Added more index detail
Having run :GET /db/data/index/node/ from the console I could see two possibly useful indexes (other indexes removed):
{
"location__neo4j-spatial__LayerNodeIndex__internal__spatialNodeLookup__": {
"template": "/db/data/index/node/location__neo4j-spatial__LayerNodeIndex__internal__spatialNodeLookup__/{key}/{value}",
"provider": "lucene",
"type": "exact"
},
"GeoTemporalThing": {
"template": "/db/data/index/node/GeoTemporalThing/{key}/{value}",
"provider": "lucene",
"type": "exact"
}
}
So perhaps this should is the correct format for the query I was trying:
start n=node:GeoTemporalThing('withinDistance:[51.526256,0.0,100.0]') return n;
But that gives me this error (which I am now Googling)
org.apache.lucene.queryParser.ParseException: Cannot parse 'withinDistance:[51.526256,0.0,100.0]': Encountered " "]" "] "" at line 1, column 35.
Was expecting one of:
"TO" ...
...
...
Update
Having decided that my index didn't exist and that it should I used the REST interface to create an index with the name that I expected SDN to create like this:
:POST /db/data/index/node
{
"name" : "location",
"config" : {
"provider" : "spatial",
"geometry_type" : "point",
"wkt" : "wkt"
}
}
And, now everything seems to work just fine. So, my question is, should I have to create that index manually? If I look at the code in org.springframework.data.neo4j.support.index.IndexType it looks as if it should use exactly the settings that I used above but it had only created the long named Lucene Index:
public enum IndexType
{
#Deprecated
SIMPLE { public Map getConfig() { return LuceneIndexImplementation.EXACT_CONFIG; } },
LABEL { public Map getConfig() { return null; } public boolean isLabelBased() { return true; }},
FULLTEXT { public Map getConfig() { return LuceneIndexImplementation.FULLTEXT_CONFIG; } },
POINT { public Map getConfig() { return MapUtil.stringMap(
IndexManager.PROVIDER, "spatial", "geometry_type" , "point","wkt","wkt") ; } }
;
public abstract MapgetConfig();
public boolean isLabelBased() { return false; }
}
I did clear down the system and the behaviour was the same, is there a step I have missed?
Software details:
Java:
neo4j 2.0.1
neo4j-ha 2.0.1
neo4j-spatial 0.12-neo4j-2.0.1
spring-data-neo4j 3.0.0.RELEASE
Standalone Servers:
neo4j-enterprise-2.0.1
neo4j-spatial-0.12-neo4j-2.0.1-server-plugin
I'm not sure if this is a bug in Spring Data when setting up the index, but manually creating the index using the REST index worked:
:POST /db/data/index/node
{
"name" : "location",
"config" : {
"provider" : "spatial",
"geometry_type" : "point",
"wkt" : "wkt"
}
}
I can now perform queries with minimal effort using cypher in an #Query annotation (more parameters coming obviously):
#Query(value = "start n=node:location('withinDistance:[51.526256,0.0,100.0]') MATCH user-[wa:WAS_HERE]-n WHERE wa.ts > {ts} return user"
Page findByTimeAtLocation(#Param("ts") long ts);

Resources