Neo4j - To compare two nodes properties with apoc.text.phonetic - neo4j

Within a Graph of Persons some of the nodes are connected with a SAME_AS relationship.
(p1:{name:'m.Verena von Habsburg-Laufenburg'})-[SAME_AS]-(p1:{name:'2m: 9.2.1354 Verena von Habsburg-Laufenburg'})
In the first example these persons are really the same but we have other example as:
(p1:{name:'m.Gf Antal Pejácsevich de Verõcze (+1838)'})-[SAME_AS]-(p2: {name:'2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze'})
Is there a chance to find a decision with apoc.text.phonetic ?

You can judge by yourself.
Your first example
WITH [
"m.Verena von Habsburg-Laufenburg",
"2m: 9.2.1354 Verena von Habsburg-Laufenburg"
] AS texts
UNWIND texts AS text
CALL apoc.text.phonetic(text) YIELD value
RETURN text, value
Results are the same :
text value
"m.Verena von Habsburg-Laufenburg" "M000V650V500H121L151"
"2m: 9.2.1354 Verena von Habsburg-Laufenburg" "M000V650V500H121L151"
Your second example
WITH [
"m.Gf Antal Pejácsevich de Verõcze (+1838)",
"2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze"
] AS texts
UNWIND texts AS text
CALL apoc.text.phonetic(text) YIELD value
RETURN text, value
Results are not the same :
text value
"m.Gf Antal Pejácsevich de Verõcze (+1838)" "M000G100A534P200C120D000V600C000"
"2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze" "M000B312G100A636P200C120D000V600C000"
Conclusion
It works for this example, but I'm not sure you can put it as a generic rule. Data lineage is complexe to achieve and you don't have any guaranty to be sure at 100%.
But definitively, apoc.text.phonetic can helps you to achieve your goal.
Update
Your query should be like this :
MATCH (n1:Person)-[r:SAME_AS]->(n2:Person)
CALL apoc.text.phonetic(n1.name) YIELD value AS n1Phonetic
CALL apoc.text.phonetic(n2.name) YIELD value AS n2Phonetic
WHERE n1Phonetic = n2Phonetic
WITH r
SET r.samePhonetic=true
Here I set the property samePhonetic to true if the phonetics are the same.
Moreover, there is an other procedure called apoc.text.phoneticDelta that can helps you to do this. With it you can defined a threshold, or directly store the delta as a property of your relationship like that :
MATCH (n1:Person)-[r:SAME_AS]->(n2:Person)
CALL apoc.text.phoneticDelta(n1.name, n2.name) YIELD delta
WITH r, delta
SET r.phoneticDelta=delta
A score of 4 means that your two strings are very similar.
A score of 0 means that your two strings are very different.

Related

Language detection using pycld2

I am trying to use the pycld2 package to detect multiple languages in text. This is the example I am testing out:
import pycld2 as cld2
text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: Connessione push-in. Terminare solido e incagliato (trefoli di classe B 7 o meno), così come i conduttori a puntale, semplicemente spingendoli in – nessun attrezzo richiesto. Der universelle Anschluss mit zusätzlichem Vorteil: Push-in-Anschluss Vollständig und verseilt abschließen (Klasse B 7 Stränge oder weniger), sowie Aderendhülsen durch einfaches Aufschieben in – kein Werkzeug erforderlich.'''
reliable, index, top_3_choices,vecs = cld2.detect(text, returnVectors=True)
The top 3 detected languages are the following:
print(top_3_choices)
(('GERMAN', 'de', 34, 1089.0), ('ITALIAN', 'it', 33, 355.0), ('ENGLISH', 'en', 32, 953.0))
According to the documentation the confidence score is the fourth argument in each tuple and the third argument corresponds to the percentage of the original text detected in the respective language. I am struggling though how to interpret the score so I can flag the confidence of the detected language. Can I somehow normalize the score to get some form of interpretable probabilities?

Substitute character for integer

I am working on a project where I need to find the integer value substituted from a character which is either 'K' 'M' or 'B'.
I am trying to find the best way for a user to input a string such as "1k" "12k" "100k" and to receive the value back in the appropriate way. Such as a user entering "12k" and I would receive "12000".
I am new to Lua, and I am not that great at string patterns.
if (string.match(text, 'k')) then
print(text)
local test = string.match(text, '%d+')
print(test)
end
local text = "1k"
print(tonumber((text:lower():gsub("[kmb]", {k="e3"}))))
I don't know what factors you use for M and B. What is that supposed to be? Million and billion?
I suggest you use the international standard. kilo (k), mega (M), giga (G) instead.
Then it would look like so:
local text = "1k"
print(tonumber((text:gsub("[mkMG]", {m = "e-3", k="e3", M="e6", G="e9"})))) --and so on
You can match the pattern as something like (%d+)([kmb]) to get the number and the suffix separately. Then just tonumber the first part and map the latter to a factor (using a table, for example) and multiply it with your result.
local factors = { k=1e3, m=1e6, --[and so on]] }
local num, suffix = string.match(text, '(%d+)([kmb])')
local result = tonumber(num) * factors[suffix]

Firebase using queryEqualTo() two times

I have the following data structure in my database:
I do understand how to query for questions in one language like de:
let ref = FIRDatabase.database().reference().child("madeGames")
ref.queryOrderedByChild("de").queryEqualToValue(true).observeSingleEventOfType(FIRDataEventType.Value) { (snap: FIRDataSnapshot) in
print("Data:", snap.value)
}
Problem:
It seems to be impossible for me to just get games in de AND en from the server.
Of course I know I can first download all the de question and then loop throw them offline but that doesn't seem to be very intelligent for me.
Does anyone know how to do that more efficiency?
One possible option is to change your structure by combining the children properties into a single property like this.
made_games
-jjjjsd903
de_en: true
de_fr: true
-j090iQ0ss
de: true
If you have a lot of languages you could also do this:
made_games
-jjjjsd903
language_available: 0 //just de
-j090iQ0ss
language_available: 3 //de_en
languages_available
0: de
1: en
2: fr
3: de_en
4: de_en_fr
Edit - and another option I just thought of... go binary!
00000001 = en
00000010 = fr
00000100 = de
00001000 = du
00010000 = ge
Then to get any combination of languages, perform a bitwise 'or' operation on the two (or more) you want.
So to get en_de, it's 00001001. To get en_du_ge it would be 00011001. If you have 8 languages, that could be represented by 8 bits and 256 combinations. So if you have 11 languages, that would be 2^11 = up to 4096 combinations.
Using this technique you could avoid storing the en_fr etc totally and just search for the bitwised value.
You could store decimal versions of the binary numbers in Firebase... just showing the binary here for clarity.

How do I transform text into TF-IDF format using Weka in Java?

Suppose, I have following sample ARFF file with two attributes:
(1) sentiment: positive [1] or negative [-1]
(2) tweet: text
#relation sentiment_analysis
#attribute sentiment {1, -1}
#attribute tweet string
#data
-1,'is upset that he can\'t update his Facebook by texting it... and might cry as a result School today also. Blah!'
-1,'#Kenichan I dived many times for the ball. Managed to save 50\% The rest go out of bounds'
-1,'my whole body feels itchy and like its on fire '
-1,'#nationwideclass no, it\'s not behaving at all. i\'m mad. why am i here? because I can\'t see you all over there. '
-1,'#Kwesidei not the whole crew '
-1,'Need a hug '
1,'#Cliff_Forster Yeah, that does work better than just waiting for it In the end I just wonder if I have time to keep up a good blog.'
1,'Just woke up. Having no school is the best feeling ever '
1,'TheWDB.com - Very cool to hear old Walt interviews! ? http://blip.fm/~8bmta'
1,'Are you ready for your MoJo Makeover? Ask me for details '
1,'Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur '
1,'happy #charitytuesday #theNSPCC #SparksCharity #SpeakingUpH4H '
I want to convert the values of second attribute into equivalent TF-IDF values.
Btw, I tried following code but its output ARFF file doesn't contain first attribute for positive(1) values for respective instances.
// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");
// Set the filter
StringToWordVector filter = new StringToWordVector();
filter.setAttributeIndicesArray(new int[]{1});
filter.setOutputWordCounts(true);
filter.setTokenizer(tokenizer);
filter.setInputFormat(inputInstances);
filter.setWordsToKeep(1000000);
filter.setDoNotOperateOnPerClassBasis(true);
filter.setLowerCaseTokens(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);
// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances, filter);
Sample output ARFF file:
#data
{0 -1,320 1,367 1,374 1,397 1,482 1,537 1,553 1,681 1,831 1,1002 1,1033 1,1112 1,1119 1,1291 1,1582 1,1618 1,1787 1,1810 1,1816 1,1855 1,1939 1,1941 1}
{0 -1,72 1,194 1,436 1,502 1,740 1,891 1,935 1,1075 1,1256 1,1260 1,1388 1,1415 1,1579 1,1611 1,1818 2,1849 1,1853 1}
{0 -1,374 1,491 1,854 1,873 1,1120 1,1121 1,1197 1,1337 1,1399 1,2019 1}
{0 -1,240 1,359 2,369 1,407 1,447 1,454 1,553 1,1019 1,1075 3,1119 1,1240 1,1244 1,1373 1,1379 1,1417 1,1599 1,1628 1,1787 1,1824 1,2021 1,2075 1}
{0 -1,198 1,677 1,1379 1,1818 1,2019 1}
{0 -1,320 1,1070 1,1353 1}
{0 -1,210 1,320 2,477 2,867 1,1020 1,1067 1,1075 1,1212 1,1213 1,1240 1,1373 1,1404 1,1542 1,1599 1,1628 1,1815 1,1847 1,2067 1,2075 1}
{179 1,1815 1}
{298 1,504 1,662 1,713 1,752 1,1163 1,1275 1,1488 1,1787 1,2011 1,2075 1}
{144 1,785 1,1274 1}
{19 1,256 1,390 1,808 1,1314 1,1350 1,1442 1,1464 1,1532 1,1786 1,1823 1,1864 1,1908 1,1924 1}
{84 1,186 1,320 1,459 1,564 1,636 1,673 1,810 1,811 1,966 1,997 1,1094 1,1163 1,1207 1,1592 1,1593 1,1714 1,1836 1,1853 1,1964 1,1984 1,1997 2,2058 1}
{9 1,1173 1,1768 1,1818 1}
{86 1,935 1,1112 1,1337 1,1348 1,1482 1,1549 1,1783 1,1853 1}
As you can see that first few instances are okay(as they contains -1 class along with other features), but the last remaining instances don't contain positive class attribute(1).
I mean, there should have been {0 1,...} as very first attribute in the last instances in output ARFF file, but it is missing.
You have to specify which is your class attribute explicitly in the java program as, when you apply StringToWordVector filter your input gets divided among specified n-grams. Hence class attribute location changes once StringToWordVector vectorizes the input. You can just use Reorder filer which will ultimately place class attribute at last position and Weka will pick last attribute as class attribute.
More info about Reordering in Weka can be found at http://weka.sourceforge.net/doc.stable-3-8/weka/filters/unsupervised/attribute/Reorder.html. Also example 5 at http://www.programcreek.com/java-api-examples/index.php?api=weka.filters.unsupervised.attribute.Reorder may help you in doing reordering.
Hope it helps.
Your process for obtaining TF-IDF seems correct.
According to my experiments, if you have n classes, Weka shows information labels for records for each n-1 classes and records for the nth class are implied.
In your case, you have 2 classes -1 and 1, so weka is showing labels in records with class label -1 and records with label 1 are implied.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources