Solr edismax with multiple function queries and boosting query in a request - solr4

I am looking for a way to introduce boosts on Popularity (integer value) and recency (Year, YYYY format) under EDisMax handler ang getting confused with bq, bf, boost and !boost syntax.
Also, on using !boost syntax, df field becomes mandatory and restricts me to search on qf fields configured in solrconfig.xml.
The cookbook4, too doesn't offer any help on this, and I purchased it for no worth :(
Can you please help me on this ?
Thanx,
Manu

Related

Unexpected result with apoc.text.sorensenDiceSimilarity?

A bit confused in regards to string similarity using Sorensen-Dice.
Apparently it makes a difference in what order parameters are being passed.
WITH
apoc.text.sorensenDiceSimilarity("+46xxxxx2260", "+46xxxxx2226") as score1,
apoc.text.sorensenDiceSimilarity("+46xxxxx2226", "+46xxxxx2260") as score2
RETURN
score1, score2
One of these scores (i.e. similarity coefficients) will say 1.0, the other 0.909090...
Does not make sense to me, but perhaps there's something with the algorithm I'm not aware of?
Any insight is appreciated.
P.S. "Neo4j Kernel", "3.5.9", "community"
This is definitely a bug and a good catch!
As alternative you can do below query which uses apoc functions as toSet and intersection and text function, split. There is hack on the query that uses ROUND(10^4/10^4) to use 4-decimal places. If you like my answer, please vote and accept it. Thanks.
WITH apoc.coll.toSet(split("+46xxxxx2260","")) as set1, apoc.coll.toSet(split("+46xxxxx2226","")) as set2
WITH set1, set2, apoc.coll.intersection(set1, set2) as common
RETURN ROUND(2*size(common)*10^4/(size(set1)+size(set2)))/10^4 as sorensenDiceSimilarity
Result:
0.9091

Best ways in text parsing for Job description (JD parsing)?

I am a developer and having little knowledge in text parsing.
I need to parse the Job description and get some outputs. I need to parse the following fields from Job description.
Job Responsibilities,
Qualification,
Specialization,
Domain,
Skills Required,
Job Description,
Work Experience Min,
Work Experience Max,
Industry,
Occupation,
Functional Area,
Currency,
Salary,
Salary Type,
Employment Type,
Work Authorisation,
Required Visa Status,
Required English Level,
Country,
State,
City,
Zipcode,
Address of Job.
To accomplish this, I am utilizing the Regex pattern matching. But the output efficiency is low many times. It sometimes requires exact pattern to identify the parameters. So it fails many times.
I found other ways too.
Named Entity Recognition:
By using Stanford NLp, I am able to get the location, address. But I don't know how can I train the module for other parameters or we have any possibilities.
Fuzzy logic:
Did some research on fuzzy logic to validate the results.
My questions are,
1. What are the approaches to accomplish the JD parsing?
2. How effective is NER?
3. Is there any conceivable outcomes to use fuzzy logic in JD text parsing?
Any help would be really appriciated.
I think you can try dependency parsing if regex doesn't work accurately. NER will not support all the findings you need. Employment type is something would like to learn from you as well.

Google spreadsheets - how to handle duration: store and use in calculations?

I've got a lot of "duration" values - basically a race duration - in a format m:ss.millis [4:03.810 for example].
Currently GS handles it as text, but I would like to use those values for comparison and create some statistics.
Is this possible? I have read this: How to format a duration as HH:mm in the new Google sheets but even though I have created a custom formats like:
or
but neither with one nor with another I cannot use those values for calculations. GS always complains about the values beeing stored as text.
I guess I'm just doing something wrong, but I definitely need to be able to provide values in this format and be able to use them in calculations.
How can I do that?
I regret that Duration seems to be a useless abomination and Sheets appears to offer no relatively easy way to convert text representation to values. Pending a better answer I suggest you convert all your durations as below:
=(left(A1,find(":",A1)-1)+right(A1,6)/60)/1440
format as Number:
mm:ss.000
and then apply your formulae.
(Change , to ; if required by your locale.)
A shorter formula might be used to cajole TIMEVALUE to work by including an hour value of 0:
=TIMEVALUE("00:"&A1)
A happy coincidence brought me back here because only very recently have I found that Google Sheets does offer a way to convert Text to Number (or I was having another aberration when I claimed otherwise). However, this still seems not to apply to Duration. Maybe there is yet hope though.

How should I present a cost field to the user, and store it in the database?

Right now I have two fields for cost. One for dollars and one for cents. This works, but it is a bit ugly. It also doesn't allow the user to enter the term "free" or "no cost" if they want. But if I only have one field, I might have to make my parser a bit smarter. What do you think?
On the server side, I combine dollars and cents to store them as decimals in my database. Mainly so that I can gather statistics (cost averages, etc.) quickly.
Do you think it is better to store the cost as a string? Then whenever I actually use the cost for stats or other purposes, I would convert it to a decimal at that point. Or am I on the right track?
There is a rule in database design that states that "atomic data" should not be split. By this rule a price, or cost is such an example of atomic data and therefore it should never be split among multiple columns just like you shouldn't split a phone number among multiple columns (unless you really have a very good reason for it - very rare)
Use a DECIMAL data type. Something like DECIMAL(8,3) should work and it's supported by all ANSI SQL compliant database products!
You can consult Joe Celko's "Thinking In Sets" book for a discussion of this topic. See section 1.6.2, pages 21-22.
EDIT -
It seems from your question that you are also concerned with how to accept user's input in a form that resembles the price (xxxx.xx) - hence the two input boxes, for the whole dollars, and the pennies.
I recommend using a single input box and then doing input validation using Regular Expressions to match your format (i.e. something like [0-9]+(.[0-9]{1,3})? would probably work but could be improved). You could then parse the validated string to a Decimal type in your language, or just pass it as a string into your database - SQL will know how to cast it to a DECIMAL type.
Keep the whole cost as decimal. If it's free, then keep the cost as 0. In presentation if cost is zero - write "free" instead of 0.
I generally store the cost as the lowest unit (pennies) and then convert it to whole dollars later.
So a cost of $4.50 gets stored as 450. Free items would be -1 pennies. You could store free things as 0 pennies as well, this gives you the flexibility to use 0 and -1 to mean two slightly different things (free vs no sale?).
It also makes it easier to support countries that don't use cents if you choose to go that route.
As for presenting the data entry field, I personally don't like it when I have to keep switching fields for tiny things (like when they break up phone numbers into 3 fields, or IP addresses into 4). I'd present one field, and let the users type the decimal point in themselves. That way, your users don't have to tab (or click, if they are unfamiliar with tab) to the next field.
Use cents, use 450 for $4.50 this will save you problems that are arising very often
from the fact that floating point operations are not safe. Just try the following expression in irb:
0.4 - 0.3 == 0.1 will return false. All because of floating point representation
innacuracies.
In my models I'm always using:
attr_accessor :price_with_cents
def price_with_cents
self.price/100.00
end
def price\_with\_cents==(num)
self.price = (num.to_f * 100.00).to_i
end
And the name of column is just price and integer type.
I don't have much experience with decimal columns and their representation in ruby (which can be float that is problematic as i've shown at the begining).
Don't allow garbage to make it to your database. If you're expecting a dollar amount on a field, than make sure it's valid before it gets in there. This will allow you to report better on the data and allow simpler formatting on output.
I suggest making this a single field with validation on update or insert.
if field != SpecialFreeTag then
try to convert to decimal
if fail then report to user
otherwise accept value
Use try parse or regular expressions to help with the validation.
I would store the cost as decimal with the scale being no less than 2 and maybe even 3-5. If something is bought in bulk the unit cost could easily include fractions of a cent. Free items have a cost of 0. If the cost is unknown then allow null values also.

.NET Currency exponent ISO_4217

I'm developing something for international use. Wondering if anyone can shed any light on whether the CultureInfo class has support for finding currency exponents for particular countries, or whether I need to feed this data in at the database level.
I can't see any property that represents this at the minute, so if anyone knows definitively if it exists, before I look for it / buy it from ISO.
Currency Exponent is the minor units of the currency.
http://en.wikipedia.org/wiki/ISO_4217 - e.g. UK is "2"
Take a look at this blog post on getting CultureInfo for a region. Basically, Window and .NET know about the user's region but not their currency. A region implies a currency, but a country can have more than currency. For example, a person in Cambodia would more than likely want to enter and use USD than Riel. If possible, when capturing any currency amount in a multi-currency system you should capture the currency ISO code.
If you just want to make a quick guess, you can create a CultureInfo object and use it's NumberDecimalDigits property. The also creates a problem when countries switch currencies. For example, if Belarus joins the EU, then it's currency would change from BYR to EUR. It's currency symbol and exponent will be out of date.
I looked at this question and provided a solution which may or may not meet your needs here: http://www.codeproject.com/KB/recipes/MoneyTypeForCLR.aspx#CurrencyType
The short of it: I implemented the ISO spec as a custom type using the spec itself to generate the values. Obviously this would need to be regularly updated in production...

Resources