I have the following structure of ... lets say products in a store. The Products have a rating. Then users can, additionally, rate the product (on their own). Lets say 300k Products plus 50k user-ratings (to each product)
Question 1: Are Subdocuments the right choice? I am adding everything with SolrJ, I did not find any other suitable method of doing that.
For sake of an example, you could copy and paste the following code to your collection:
<add>
<doc>
<field name="id">1</field>
<field name="title" >Product Title LOLO</field>
<field name="content_type" >parent</field>
<field name="rating_f" >7</field>
<doc>
<field name="id">1</field>
<field name="user_id_s" >123</field>
<field name="userrating_f" >1.2</field>
</doc>
</doc>
<doc>
<field name="id">2</field>
<field name="title" >Product Title LULU</field>
<field name="content_type" >parent</field>
<field name="rating_f" >2</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title" >Product Title LALA</field>
<field name="content_type" >parent</field>
<field name="rating_f" >1.4</field>
<doc>
<field name="id">1</field>
<field name="user_id_s" >123</field>
<field name="userrating_f" >5</field>
</doc>
</doc>
</add>
Question 2 (The Important one): How can I query this index now, so that the documents are scored with a boost on the user-rating first (if one exists) and then by the product rating (and then by other fields, like the creation date, views, buys, ...)? Is that even possible?
I was looking into something like that:
{!parent which="content_type:parent"}(user_id_s:123 AND _val_:userrating_f)^2.0 _val_:rating_f^2.0 *:*
That should return the documents in this order (ids): 3, 1, 2
But instead it returns:
{
"responseHeader": {
"status": 500,
"QTime": 1,
"params": {
"indent": "true",
"q": "{!parent which=\"content_type:parent\"}(user_id_s:123 AND _val_:userrating_f)^2.0 _val_:rating_f^2.0 *:*",
"_": "1421996862814",
"wt": "json"
}
},
"error": {
"msg": "child query must only match non-parent docs, but parent docID=3 matched childScorer=class org.apache.lucene.search.DisjunctionSumScorer",
"trace": "java.lang.IllegalStateException: child query must only match non-parent docs, but parent docID=3 matched childScorer=class org.apache.lucene.search.DisjunctionSumScorer\n\tat org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinScorer.nextDoc(ToParentBlockJoinQuery.java:344)\n\tat org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)\n\tat org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)\n\tat org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)\n\tat org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)\n\tat org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)\n\tat org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:209)\n\tat org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1619)\n\tat org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)\n\tat org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514)\n\tat org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:745)\n",
"code": 500
}
}
there is SolrInputDocument.getChildrenDocs() collection or so. use it.
to propagate score from children query to parent ones you need https://issues.apache.org/jira/browse/SOLR-5882
the problem is that functional queries matches every docs, thus it violates orthogonality and causes the exception. intersect children query with +content_type:child
Trying to process some XML that comes from an application called TeleForm. This is form scanning software and it grabs the data and puts it into XML. This is a snippet of the XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Withdrew" type="string" length="1"></Field>
</Record>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000001</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Withdrew" type="string" length="1"></Field>
</Record>
</Records>
I've dealt with this in an other system, probably using a custom parser we wrote. I figured it would be no problem in Rails, but I was wrong.
Parsing this with Hash.from_xml or from Nokogiri does not give me the results I expected, I get:
{"Records"=>{"Record"=>[{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]},
{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}
After spending way too much time on this, I discovered if I gsub out the type and length attributes, I get what I expected (even if it is wrong! I only removed on the first record node).
{"Records"=>{"Record"=>[{"Field"=>[{"id"=>"ImageFilename", "Value"=>"00000022000000"},
{"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, {"id"=>"Withdrew"}]},
{"Field"=>["", {"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>"3"}, ""]}]}}
Not being well versed in XML, I assume this style of XML using type and length attributes is trying to convert to the data types. In that case, I can understand why the "Withdrew" attribute showed up as empty, but don't understand why the "ImageFilename" was empty - it is a 14 character string.
I've got the work around with gsub, but is this invalid XML? Would adding a DTD (which TeleForm should have provided) give me different results?
EDIT
I'll provide a possible answer to my own question with some code as an edit. The code follows some of the features in the one answer I did receive from Mark Thomas, but I decided against Nokogiri for the following reasons:
The xml is consistent and alway contains the same tags (/Records/Record/Field) and attributes.
There can be several hundred records in each XML file and Nokogiri seems a little slow with only 26 records
I figured out how to get Hash.from_xml to give me what I expected (does not like type="string", but only use the hash to populate a class.
An expanded version of the XML with one complete record
<?xml version="1.0" encoding="ISO-8859-1"?>
<Records>
<Record>
<Field id="ImageFilename" type="string" length="14"><Value>00000022000000</Value></Field>
<Field id="DocID" type="string" length="15"><Value>731192AIINSC</Value></Field>
<Field id="FormID" type="string" length="6"><Value>AIINSC</Value></Field>
<Field id="Availability" type="string" length="18"><Value>M T W H F S</Value></Field>
<Field id="Criterion_1" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_2" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_3" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_4" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_5" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_6" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_7" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_8" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_9" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_10" type="number" length="2"><Value>3</Value></Field>
<Field id="Criterion_11" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_12" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_13" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_14" type="number" length="2"><Value>0</Value></Field>
<Field id="Criterion_15" type="number" length="2"><Value>0</Value></Field>
<Field id="DayTraining" type="string" length="1"><Value>Y</Value></Field>
<Field id="SaturdayTraining" type="string" length="1"></Field>
<Field id="CitizenStageID" type="string" length="12"><Value>731192</Value></Field>
<Field id="NoShow" type="string" length="1"></Field>
<Field id="NightTraining" type="string" length="1"></Field>
<Field id="Withdrew" type="string" length="1"></Field>
<Field id="JobStageID" type="string" length="12"><Value>2292</Value></Field>
<Field id="DirectHire" type="string" length="1"></Field>
</Record>
</Records>
I am only experimenting with a workflow prototype to replace an aging system written in 4D and Active4D. This area of processing TeleForms data was implemented as a batch operation and it still may revert to that. I am just trying to merge some of the old viable concepts in a new Rails implementation. The XML files are on a shared server and will probably have to be moved into the web root and then some trigger set to process to files.
I am still in the defining stage, but my module/classes to handle the InterviewForm is looking like this and may change (with little error trapping, still trying to get into testing and my Ruby is not as good as it should be after playing with Rails for about 5 years!):
module Teleform::InterviewForm
class Form < Prawn::Document
# Not relevant to this question, but this class generates the forms from a Fillable PDF template and
# relavant Model(s) data.
# These forms, when completed are what is processsed by TeleForms and produces the xml.
end
class RateForms
attr_accessor :records, :results
def initialize(xml_path)
fields = []
xml = File.read(xml_path)
# Hash.from_xml does not like a type of "string"
hash = Hash.from_xml(xml.gsub(/type="string"/,'type="text"'))
hash["Records"]["Record"].each do |record|
#extract the field form each record
fields << record["Field"]
end
#records = []
fields.each do |field|
#build the records for the form
#records << Record.new(field)
end
#results = rate_records
end
def rate_records
# not relevant to the qustions but this is where the data is processed and a bunch of stuff takes place
return "Any errors"
end
end
class Record
attr_accessor(*[:image_filename, :doc_id, :form_id, :availability, :criterion_1, :criterion_2,
:criterion_3, :criterion_4, :criterion_5, :criterion_6, :criterion_7, :criterion_8,
:criterion_9, :criterion_10, :criterion_11, :criterion_12, :criterion_13, :criterion_14, :criterion_15,
:day_training, :saturday_training, :citizen_stage_id, :no_show, :night_training, :withdrew, :job_stage_id, :direct_hire])
def initialize(fields)
fields.each do |field|
if field["type"] == "number"
try("#{field["id"].underscore.to_sym}=", field["Value"].to_i)
else
try("#{field["id"].underscore.to_sym}=", field["Value"])
end
end
end
end
end
Thanks for adding the additional information that this is a rating for an interviewee. Using this domain information in your code will likely improve it. You haven't posted any code, but generally using domain objects leads to more concise and more readable code.
I recommend creating a simple class representing a Rating, rather than transforming data from XML to a data structure.
class Rating
attr_accessor :image_filename, :criterion_1, :withdrew
end
Using the above class, here's one way to extract the fields from the XML using Nokogiri.
doc = Nokogiri::XML(xml)
ratings = []
doc.xpath('//Record').each do |record|
rating = Rating.new
rating.image_filename = record.at('Field[#id="ImageFilename"]/Value/text()').to_s
rating.criterion_1 = record.at('Field[#id="Criterion_1"]/Value/text()').to_s
rating.withdrew = record.at('Field[#id="Withdrew"]/Value/text()').to_s
ratings << rating
end
Now, ratings is a list of Rating objects, each with methods to retrieve the data. This is a lot cleaner than delving into a deep data structure. You could even improve on the Rating class further, for example creating a withdrew? method that returns a true or false.
It appears XmlSimple (by maik) is better suited for this task then the unreliable and inconsistent Hash.from_xml implementation.
A port of the tried and tested perl module of the same name, which has several notable advantages.
It is consistent, whether you find one or many occurrences of a node
does not choke and garble the results
able te distinguish between attributes and node content.
Running the above same xml document through the parser:
XmlSimple.xml_in xml
Will produce the following result.
{"Record"=>
[{"Field"=>
[{"id"=>"ImageFilename", "type"=>"string", "length"=>"14", "Value"=>["00000022000000"]},
{"id"=>"DocID", "type"=>"string", "length"=>"15", "Value"=>["731192AIINSC"]},
{"id"=>"FormID", "type"=>"string", "length"=>"6", "Value"=>["AIINSC"]},
{"id"=>"Availability", "type"=>"string", "length"=>"18", "Value"=>["M T W H F S"]},
{"id"=>"Criterion_1", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_2", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_3", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_4", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_5", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_6", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_7", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_8", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_9", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_10", "type"=>"number", "length"=>"2", "Value"=>["3"]},
{"id"=>"Criterion_11", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_12", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_13", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_14", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"Criterion_15", "type"=>"number", "length"=>"2", "Value"=>["0"]},
{"id"=>"DayTraining", "type"=>"string", "length"=>"1", "Value"=>["Y"]},
{"id"=>"SaturdayTraining", "type"=>"string", "length"=>"1"},
{"id"=>"CitizenStageID", "type"=>"string", "length"=>"12", "Value"=>["731192"]},
{"id"=>"NoShow", "type"=>"string", "length"=>"1"},
{"id"=>"NightTraining", "type"=>"string", "length"=>"1"},
{"id"=>"Withdrew", "type"=>"string", "length"=>"1"},
{"id"=>"JobStageID", "type"=>"string", "lth"=>"12", "Value"=>["2292"]},
{"id"=>"DirectHire", "type"=>"string", "length"=>"1"}]
}]
}
I am contemplating fixing the problem and providing Hash with a working implementation for from_xml and was hoping to find some feedback from others who reached the same conclusion. Surely we are not the only ones with these frustrations.
In the meantime we may find solace in knowing there is something lighter than Nokogiri and its full kitchen sink for this task.
nJoy!