jena reasoning on 2 dataset and add new entity in consequence - jena

i have a result binding in jena query solution(say sol1: user=u1, location=loc1, LocationType=type1) and I want to use a dataset to extend my result binding using a set of jena rules. in fact, having sol1 and having
loc1 location:ispartof loc2
loc2 rdf:type loc2Type
in my dataset, i want to add this new solution to my result set:
sol2: user1=u1, location=loc2, locationType=loc2Type
for that, i need to add my solution set to my dataset, write a rule like
#prefix pre: <http://jena.hpl.hp.com/prefix#>.
[rule1: (?sol pre:user ?a) (?sol pre:location ?b) (?sol pre:lcationType ?c) (?b location:ispartof ?d) (?d rdf:type ?type) -> (sol2 pre:user ?a) (sol2 pre:location ?d) (sol2 pre:locationType ?type) ]
do inference based on above rule.Afterward to extract all solutions from dataset i need to query dataset with
#prefix pre: <http://jena.hpl.hp.com/prefix#>.
select * WHERE {?sol pre:user ?a. ?sol pre:location ?b. ?sol pre:lcationType ?c.}
Now my problem is
1) is there any way to prevent adding my solutions to my big dataset by writing resoning rule on 2 datasets?
2)how to individually name each new solution in rule consequence?
Thanks.

Related

Inferring Inverse Property in Protege

I have created the relationship A 'is functional parent of' B and defined 'has functional parent' as the inverse of 'is functional parent of'. 'A' and 'B' are both subclasses of 'chemical entity'.
I want Protege to infer B 'has functional parent' A. The query 'has functional parent' some A fails.
Error #1: Not understanding open world
I realized that some implies that not all B have the relationship 'has functional parent' with 'A'. However, the query 'chemical entity' and 'has functional parent' still fails.
My ontology has no instances. I was hoping the query wound find subclasses.
Turtle File
#prefix : <http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#> .
#prefix owl: <http://www.w3.org/2002/07/owl#> .
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#prefix xml: <http://www.w3.org/XML/1998/namespace> .
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
#prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#base <http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10> .
<http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10> rdf:type owl:Ontology .
#################################################################
# Object Properties
#################################################################
### http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#hasFunctionalParent
:hasFunctionalParent rdf:type owl:ObjectProperty ;
owl:inverseOf :isFunctionalParentOf .
### http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#isFunctionalParentOf
:isFunctionalParentOf rdf:type owl:ObjectProperty .
#################################################################
# Classes
#################################################################
### http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#A
:A rdf:type owl:Class ;
rdfs:subClassOf :Z ,
[ rdf:type owl:Restriction ;
owl:onProperty :isFunctionalParentOf ;
owl:someValuesFrom :B
] .
### http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#B
:B rdf:type owl:Class ;
rdfs:subClassOf :Z .
### http://www.semanticweb.org/michaelchary/ontologies/2020/8/untitled-ontology-10#Z
:Z rdf:type owl:Class .
### Generated by the OWL API (version 4.5.9.2019-02-01T07:24:44Z) https://github.com/owlcs/owlapi
From the axioms you stated in your ontology, there is absolutely nothing from which the reasoner can derive that B hasFunctionalParent A.
To understand why this is the case, it is helpful to think in terms of individuals even though your ontology does not include any explicit individuals. When the reasoner runs, it tries to generate a model based on the axioms in the ontology. A model consists of generated individuals that adheres to the axioms of your ontology.
For illustration purposes, let us assume the universe of individuals consists of the following numbers:
Domain = {0, 1, 2, 3, 4, 5, 6, 7},
Z = {1, 2, 3, 5, 6, 7},
A = {5, 7} and
B = {2, 3, 6}
Then you have an object property hasFunctionalParent with its inverse. For short I will refer to hasFunctionalParent as R and its inverse as invR. What does R and invR mean? It basically states that when 2 individuals in our domain are related via R, they are also related via invR. That is, if we have R(1, 2), then invR(2, 1) also hold.
Stating that A subClassOf invR some B implies that each individual of A is related via invR to at least 1 individual of B. Thus, if we have invR(5, 2) and invR(7, 3), we also will have R(2, 5) and R(3, 7). However, this says nothing about the class B in general. It is completely possible that R(6, 0) holds. Therefore the reasoner cannot infer that B hasFunctionalParent A.
To get B and Z for the query "find the super classes of hasFunctionalParent some B" (that means "superclasses" must be ticked in Protege when doing the query) you have to state that isFunctionalParentOf has domain A and range B. This states that whenever 2 individuals x and y are related via isFunctionalParentOf, we can assume x is an instance of A and y is an instance of B.
Lastly, you will note that you will need to use the DL query tab in Protege to get to this inference. In particular it is not shown as part of the inferences after reasoning. Why is that? That is because Protege only shows inferences of named classes. hasFunctionalParent some B is an anonymous class, therefore this inference is not shown. A trick to make this show in Protege is to add an arbitrary concept say X that you set as equivalent to hasFunctionalParent some B. If you now run the reasoner, Protege will infer that X subClassOf B.

The tensor product ti() in GAM package gives incorrect results

I am surprising to notice that it is somehow difficult to obtain a correct fit of interaction function from gam().
To be more specific, I want to estimate an additive function:
y=m_1(x)+m_2(z)+m_{12}(x,z)+u,
where m_1(x)=x^2, m_2(z)=z^2,m_{12}(x,z)=xz. The following code generate this model:
test1 <- function(x,z,sx=1,sz=1) {
#--m1(x) function
m.x<-x^2
m.x<-m.x-mean(m.x)
#--m2(z) function
m.z<-z^2
m.z<-m.z-mean(m.z)
#--m12(x,z) function
m.xz<-x*z
m.xz<-m.xz-mean(m.xz)
m<-m.x+m.z+m.xz
return(list(m=m,m.x=m.x,m.z=m.z,m.xz=m.xz))
}
n <- 1000
a=0
b=2
x <- runif(n,a,b)/20
z <- runif(n,a,b)
u <- rnorm(n,0,0.5)
model<-test1(x,z)
y <- model$m + u
So I use gam() by fitting the model as
b3 <- gam(y~ ti(x) + ti(z) + ti(x,z))
vis.gam(b3);title("tensor anova")
#---extracting basis matrix
B.f3<-model.matrix.gam(b3)
#---extracting series estimator
b3.hat<-b3$coefficients
Question: when I plot the estimated function by gam()above against its true function, I end up with
par(mfrow=c(1,3))
#---m1(x)
B.x<-B.f3[,c(2:5)]
b.x.hat<-b3.hat[c(2:5)]
plot(x,B.x%*%b.x.hat)
points(x,model$m.x,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m2(z)
B.z<-B.f3[,c(6:9)]
b.z.hat<-b3.hat[c(6:9)]
plot(z,B.z%*%b.z.hat)
points(z,model$m.z,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m12(x,z)
B.xz<-B.f3[,-c(1:9)]
b.xz.hat<-b3.hat[-c(1:9)]
plot(x,B.xz%*%b.xz.hat)
points(x,model$m.xz,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
However, the function estimate of m_1(x) is largely different from x^2, and the interaction function estimate m_{12}(x,z) is also largely different from xz defined in test1 above. The results are the same if I use predict(b3).
I really can't figure it out. Can anybody help me out by explaining why the results end up with this? Greatly appreciate it!
First, the problem of the above issue is not due to the package, of course. It is closely related to the identification conditions of the smooth functions. One common practice is to impose the assumptions that E(mj(.))=0 for all individual function j=1,...,d, and E(m_ij(x_i,x_j)|x_i)=E(m_ij(x_i,x_j)|x_j)=0 for i not equal to j. Those conditions require one to employ centered basis function in series estimator, which has been done already in GAM package. However, in my case above, function m(x,z)=x*z defined in test1 does not satisfy the above identification assumptions, since the integral of x*z with respect to either x or z is not zero when x and z have range from zero to two.
Furthermore, series estimator allows the individual and interaction function to be identified if one impose m(0)=0 or m(0,x_j)=m(x_i,0)=0. This can be readily achieved if we center the basis function around zero. I have tried both cases, and they work well whenever DGP satisfies the identification conditions.

GeoSPARQL functions and spatial reference systems (SRS)

I am trying to represent in an ontology a few geometric objects (polygon, lines, points, etc.) and calculate their spatial/topological relations, through the adoption of GeoSPARQL relevant functions (sfTouches, sfEquals, sfContains, etc.). I am using GraphDB, with the GeoSPARQL plugin enabled.
I have seen that in the WKT representation of the geometric object, GeoSPARQL uses the concept of a default spatial reference system (i.e. the <http://www.opengis.net/def/crs/OGC/1.3/CRS84> URI which corresponds to the WGS84 coordinate reference system (CRS)). However, in my use case, the coordinates of the geometrical objects actually correspond to values in a 2D Cartesian coordinate system.
I found in the EPSG Geodetic Parameter Registry the proper CRS for representing Cartesian coordinates and I attached the proper URI in the WKT representation, but the GeoSPARQL functions do not return any result or error.
My question is the following: "Do the GeoSPARQL functions operate properly when representing spatial objects in any other type of CRS, apart from the default one?".
Thank you in advance.
Currently GDB does not support alternative CRS in WKT literals but supports them in GML literals (issue GDB-3142). GML literals are slightly more complex but still easy enough to generate, let us know if you need help with that.
However, I question your assertion that you have Cartesian coordinates. On one hand, any pair (lat,long) or (nothing,easting) is a Cartesian coordinate. On the other hand, since the Earth is not flat, any CRS or projection method is only an approximation, and many of them are tuned for specific localities.
So please tell us which EPSG CRS you picked, and a bit about the locality of your data.
Your example, slightly reformatted, and using normal turtle shortenings:
ex:polygon_ABCD rdf:type ex:ExampleEntity ;
geo:hasGeometry ex:geometry_polygon_ABCD .
ex:geometry_polygon_ABCD a geo:Geometry, sf:Polygon ;
geo:asWKT "<opengis.net/def/cs/EPSG/0/4499> Polygon((389.0 1052.0, 563.0 1052.0, 563.0 1280.0, 389.0 1280.0, 389.0 1052.0))"^^geo:wktLiteral .
ex:point_E rdf:type ex:ExampleEntity ;
geo:hasGeometry ex:geometry_point_E .
ex:geometry_point_E a geo:Geometry, sf:Point ;
geo:asWKT "<opengis.net/def/cs/EPSG/0/4499> Point(400.0 1100.0)"^^geo:wktLiteral ; .
You must use a specific URL for the CRS and cannot omit http:, so the correct URL is http://www.opengis.net/def/crs/EPSG/0/4499.
But you can see from the returned description that this CRS is applocable to "China - onshore and offshore between 120°E and 126°E". I'm not an expert in geo projections so I can't guarantee whether this CRS will satisfy your need "leave my coordinates alone, they are just meters". I'd look for a UK (OrdnanceSurvey) CRS with easting and northing coordinates.
To learn how to format GML:
see the GeoSPARQL spec (OGC 11-052r4) p18, whchc gives an example about gml:Point.
then google for gml:Polygon. There are many links but one that gives examples is http://www.georss.org/gml.html
Armed with this knowledge, we can reformat your example to GML:
ex:polygon_ABCD rdf:type ex:ExampleEntity ;
geo:hasGeometry ex:geometry_polygon_ABCD .
ex:geometry_polygon_ABCD a geo:Geometry, sf:Polygon ;
geo:asGML """
<gml:Polygon xmlns:gml="http://www.opengis.net/gml" srsName="http://www.opengis.net/def/crs/EPSG/0/TODO">
<gml:exterior>
<gml:LinearRing>
<gml:posList>
389.0 1052.0, 563.0 1052.0, 563.0 1280.0, 389.0 1280.0, 389.0 1052.0
</gml:posList>
</gml:LinearRing>
</gml:exterior>
</gml:Polygon>
"""^^geo:gmlLiteral.
ex:point_E rdf:type ex:ExampleEntity ;
geo:hasGeometry ex:geometry_point_E .
ex:geometry_point_E a geo:Geometry, sf:Point ;
geo:asGML """
<gml:Point xmlns:gml="http://www.opengis.net/gml" srsName="http://www.opengis.net/def/crs/EPSG/0/TODO">
<gml:pos>
400.0 1100.0
</gml:pos>
</gml:Point>
"""^^geo:gmlLiteral.
The """ (long quote) allows us to use " inside the literal without quoting
replace TODO with the better CRS you picked
the documentation http://graphdb.ontotext.com/documentation/master/enterprise/geosparql-support.html#geosparql-examples gives an example similar to yours but it cheats a bit because all coordinates are in the range (-90,+90) so it can just use WGS.
after you debug using geof: topology functions, turn on indexing and switch to geo: predicates, because the functions are slow (they check every geometry against every other) while the predicates use the special geo index
Let me know how it goes!

what's the good practice to program with dynamic inputs in dplyr 0.3

My original intention to do this is to integrate dplyr with shiny
Prior to 0.3 I have used eval(parse(text=....)), do.call() approach.
In 0.3, I saw two more options, for example:
var <- c('disp','hp')
select_(mtcars,.dots = as.lazy_dots(var))
select(mtcars,one_of(var))
but which one is better? I intended to pass the selectInput values from Shiny app to do data transformations through dplyr.
Another question, what will be the right way to join two different dataset with dynamic but different key column? Is there anything I can leverage in 0.3?
for example
col_a, col_b are key variables to join from datasets a & b
left_join(dataset_a,dataset_b, by=c(col_a=col_b))
Thanks.
After a few attempts, here is my solution for the 2nd question, use a function to create a named vector, and then feed to left_join.
joinCol_a = xxx
joinCol_b = xxx
f <- function(a,b){
vec <- c(b)
names(vec) <- a
return(vec)
}
left_join(dataset_a,dataset_b,by=f(joinCol_a,joinCol_b))
I know it's not the best solution but this is what I can think of so far.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources