This use case should be reasonably common but I can't seem to figure it out or find any ways to solve it on the web. I would like to search an employee database using natural language, ie an unstructured word string.
Let's say you have a core data entity with three attributes as follows:
firstname string
lastname string
Employee_id string
And you have the following managed objects
Jane | Smith | 1
Jane | Smiley | 2
Jane | Doe | 3
John | Doe | 4
Richard | Doe | 5
How would you get the employee-id for the string 'Jane Doe' or 'Doe Jane' or 'Doe Jane accounting'?
If you knew for sure that there were just two words and the order was firstname, second name then you could do.
NSArray *words = [string componentsSeparatedByString:#" "];
NSPredicate *pred = [NSPredicate predicateWithFormat:#"firstname beginswith[cd] %# AND lastname beginswith[cd]", words[0],words[1]];
But in this case, I don't know the order or number of words.
Thanks in advance for any suggestions.
You can use NSCompoundPredicate and put a bunch of NSPredicates with the various ordering possibilities like the one you made in your example, but that will obviously be limited to the number of words you want to write predicate combinations for.
(probably obviously, but you are creating a series of predicates like: ( (stringA and string B) or (stringB and stringA), or (stringA and stringC) or (stringC and stringA) or (stringB and stringC) or (stringC and stringB) ).
You can create these predicates relatively cleanly by writing a predicate with variables and then using predicateWithSubstitutionVariables: repeatedly with different dictionary of variable -> word mappings to get the various arrangements.
The trick is, at some point you are trying to do free form full text searching across structured data without a full text index. Here's a decent blog post (though old) on the challenges of doing that.
An alternative is to design your user interface to get the user to you the data in an easier form to deal with. For example, give the user a form for their query with the valid fields to fill in. Or at least direct them with prompt text for your open entry single text field with something like "Enter the person's first and last name" or whatever makes sense.
Related
I want to use Sumo Logic to count how often different APIs are called. I want to have a table with API call name and value. My current query is like this:
_sourceCategory="my_category"
| parse regex "GET.+443 (?<getUserByUserId>/user/v1/)\d+" nodrop
| parse regex "GET.+443 (?<getUserByUserNumber>/user/v1/userNumber)\d+"
| count by getUserByUserId, getUserByUserNumber
This gets correct values but they go to different columns. When I have more variables, table becomes very wide and hard to read.
I figured it out, I need to use same group name for all rexexes. Like this:
_sourceCategory="my_category"
| parse regex "GET.+443 (?<endpoint>/user/v1/)\d+" nodrop
| parse regex "GET.+443 (?<endpoint>/user/v1/userNumber)\d+"
| count by endpoint
I am currently using this formula to get all the data from everyone whose first name is "Peter", but my problem is that if someone is called "Simon Peter" this data is gonna show up on the formula output.
=QUERY('Data'!1:1000,"select * where B contains 'Peter'")
I know that for the other formulas if I add an * to the String this issue is resolved. But in this situation for the QUERY formula the same logic do not applies.
Do someone knows the correct syntax or a workaround?
How about classic SQL syntax
=QUERY('Data'!1:1000,"select * where B like 'Peter %'")
The LIKE keyword allows use of wildcard % to represent characters relative to the known parts of the searched string.
See the query reference: developers.google.com/chart/interactive/docs/querylanguage You could split firstname and lastname into separate columns, then only search for firstnames exactly equal to 'Peter'. Though you may want to also check if lowercase/uppercase where lower(B) contains 'peter' or whitespaces are present in unexpected places (e.g., trim()). You could also search only for values that start with Peter by using starts with instead of contains, or a regular expression using matches. – Brian D
It seems that for my case using 'starts with' is a perfect fit. Thank you!
I have a PFObject subclass which stores an array of strings as one of its properties. I would like to query for all objects of this class where one or more of these strings start with a provided substring.
An example might help:
I have a Person class which stores a firstName and lastName. I would like to submit a PFQuery that searches for Person objects that match on name. Specifically, a person should be be considered a match if if any ‘component’ of either the first or last name start with the provided search term.
For example, the name "Mary Beth Smith-Jones" should be considered a match for beth and bet, but not eth.
To assist with this, I have a beforeSave trigger for the Person class that breaks down the person's first and last names into separate components (and also lowercases them). This means that my "Mary Beth Smith-Jones" record looks like this:
firstName: “Mary Beth”
lastName: “Smith-Jones”
searchTerms: [“mary”, “beth”, “smith”, “jones”]
The closest I can get is to use whereKey:EqualTo which will actually return matches when run against an array:
let query = Person.query()
query?.whereKey(“searchTerms”, equalTo: “beth”)
query?.findObjectsInBackgroundWithBlock({ (places, error) -> Void in
//Mary Beth is retuned successfully
})
However, this only matches on full string equality; query?.whereKey(“searchTerms”, equalTo: “bet”) does not return the record in question.
I suppose I could explode the names and store all possible sequential components as search terms (b,e,t,h,be,et,th,bet,etc,beth, etc) but that is far from scalable.
Any suggestions for pulling these records from Parse? I am open to changing my approach if necessary.
Have you tried whereKey:hasPrefix: for this? I am not sure if this can be used on array values.
https://parse.com/docs/ios/guide#queries-queries-on-string-values
I have database where it has one attribute that actually carries two(string separated by "; "). Take a look on the following example:
Example
This is my Database A (Source) which has a table like this:
In fact, this seems all ok, but when you assume that the author can have one or more entrance. you will have a record like this:
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
So what i intend to do on Database B(Target) is something like this:
where a_id from table Authors is a foreign key that references author_id on table Document.
In first place ensure that i will fill all the authors (which there's no problem with that) and then assign the group authors to the respective document(which is the problem) because i can have this situation
Authors (id 1, name "John Matt")
(id 2, name "John Matt")
(id 2, name "Mary Knight")
Notice that the id 2 will be the one to be inserted attending to the previous example.
Question:
How this procedure can be done using a ETL process using kettle?
Is this a good pratice or a string attribute separated by "; " it's good enough?
If i have understood your question correctly, you have a database having rows like
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
Now you need to extract the multiple authors which are separated by ;.
PDI provides you with a step called Split fields to row. Use the separator as ; and you will get multiple split copy of the data. Next you can then use an unique id from the author table.
Alternatively, you can use Modified Java Script step to split the rows also. Recommend you to use the first step.
Hope this helps :)
I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.