I'm using code climate on one of my projects, and I'm getting an error for have "too complex" of code. I'm not sure how to make the code it's calling out less complex? Here it is:
Method:
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: {
amount: payment_details.coupon.amount,
type: payment_details.coupon.coupon_type,
name: payment_details.coupon.name,
valid: payment_details.coupon.valid_coupon?,
}
}
end
It's just JSON that I tucked away in a model. Everything on my branch is great expect for this? I'm not sure what to do? Any ideas on how I can make this less complex?
I wouldn't care too much if Code Climate thinks something is too complex, but actually is easy to understand. Code Climate should help you to write better, easy to read code. But it doesn't provide hard rules.
If you really want to change something, you might want to move the generation of the coupon sub hash to the Coupon model, because it only depends on values provided by the coupon association:
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: payment_details.coupon.as_json
}
end
# in coupon.rb
def as_json
{
amount: amount,
type: coupon_type,
name: name,
valid: valid_coupon?
}
end
A similar refactoring can be done with payment_details but it is not sure, where this attribute is coming from and if it is an associated model.
Please just ignore the complexity warnings.
They are misguided.
These warnings are based on fake science.
Cyclomatic complexity has been proposed in 1976 in an academic journal and has alas been adopted by tool builders because it is easy to implement.
But that original research is flawed.
The original paper proposes a simple algorithm to compute complexity for Fortran code but does not give any evidence that the computed number actually correlates to readability and understandability of code. Nada, niente, zero, zilch.
Here is their abstract
This paper describes a graph-theoretic complexity measure and
illustrates how it can be used to manage and control program
complexity. The paper first explains how the graph-theory concepts
apply and gives an intuitive explanation of the graph concepts in
programming terms. The control graphs of several actual Fortran
programs are then presented to illustrate the correlation between
intuitive complexity and the graph-theoretic complexity. Several
properties of the graph-theoretic complexity are then proved which
show, for example, that complexity is independent of physical size
(adding or subtracting functional statements leaves complexity
unchanged) and complexity depends only on the decision structure of a
program.
The issue of using non structured control flow is also
discussed. A characterization of non-structured control graphs is given
and a method of measuring the "structuredness" of a program is
developed. The relationship between structure and reducibility is
illustrated with several examples.
The last section of this paper
deals with a testing methodology used in conjunction with the
complexity measure; a testing strategy is defined that dictates that a
program can either admit of a certain minimal testing level or the
program can be structurally reduced
Source http://www.literateprogramming.com/mccabe.pdf
As you can see anecdotical evidence only is given "to illustrate the correlation between intuitive complexity and the graph-theoretic complexity" and the only proof is that code can be rewritten to have a lower complexity number as defined by this metric. Which is a pretty non-sensical proof for a complexity metric and very common for the quality of research from that time. This paper would not be publishable by today's standards.
The authors of the paper have not done an user research and their algorithm is no grounded in any actual evidence. And no research has been able to prove a link between cyclomatic complexity and code comprehension since. Not to mention that this complexity metric was proposed for Fortran rather than modern high level languages.
The best way to ensure code comprehension is code review. Just simply ask another person to read your code and fix whatever they don't understand.
So just turn these warning off.
You're trying to describe a transformation of data from one complex structure to another using code and that creates a lot of "complexity" in the eyes of a review tool like Code Climate.
One thing that might help is to describe the transformation in terms of data:
PAYMENT_DETAILS_PLAN = {
total_ticket_count: [ :tickets, :count ],
subtotal: [ :subtotal, :to_f ],
discount: [ :discount, :to_f ],
fees: [ :fees_total, :to_f ],
total: [ :total_in_dollars, :to_f ],
coupon: {
amount: [ :coupon, :amount ],
type: [ :coupon, :coupon_type ],
name: [ :coupon, :name ],
valid: [ :coupon, :valid_coupon? ]
}
}
This might not seem like a huge change, and really it isn't, but it offers some measurable benefits. The first is that you can reflect on it, you can inspect the configuration using code. The other is once you've laid down this format you might be able to write a DSL to manipulate it, to filter or augment it, among other things. In other words: It's flexible.
Interpreting that "plan" isn't that hard:
def distill(obj, plan)
plan.map do |name, call|
case (call)
when Array
[ name, call.reduce(obj) { |o, m| o.send(m) } ]
when Hash
[ name, distill(obj, plan) ]
end
end.to_h
end
When you put that into action this is what you get:
def apply_json
distill(payment_details, PAYMENT_DETAILS_PLAN)
end
Maybe that approach helps in other situations where you're doing largely the same thing.
You may extract the coupon subhash as the method returning that subhash. It will reduce the code complexity (as for codeclimate) but it's not really necessary. However some people believes that method must have 5 strings or less. And they are right in the most use cases. But it's totally up to you to decide.
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: subhash
}
end
def subhash
{
amount: payment_details.coupon.amount,
type: payment_details.coupon.coupon_type,
name: payment_details.coupon.name,
valid: payment_details.coupon.valid_coupon?,
}
end
Related
Background
I'm writing a Swift application that requires the classification of user events by categories. These categories can be things like:
Athletics
Cinema
Food
Work
However, I have a set list of these categories, and do not wish to make any more than the minimal amount I believe is needed to be able to classify any type of event.
Question
Is there a machine learning (nlp) procedure that does the following?
Takes a block of text (in my case, a description of an event).
Creates a "percentage match" to each possible classification.
For instance, suppose the description of an event is as follows:
Fun, energetic bike ride for people of all ages.
The algorithm in which this description would be passed in would return an object that looks something like this:
{
athletics: 0.8,
cinema: 0.1,
food: 0.06,
work: 0.04
}
where the values of each key in the object is a confidence.
If anyone can guide me in the right direction (or even send some general resources or solutions specific to iOS dev), I'd be super appreciative!
You are talking about typical classification model. I believe iOS offers you APIs to do this inside your app. Here Look for natural language processing bit - NLP
Also you are probably being downvoted because this forum typically looks to solve specific programming queries and not generic ones (this is an assumption and there could be another reason for downvotes.)
{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).
Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing.
I am using PredictionIO 0.9.6 and the Recommendation template for collaborative filtering. The ratings in my data set are numbers between 1 and 10. When I first trained a model with defaults from the template (using ALS.train), the predictions were horrible, at least subjectively. Scores ranged up to 60.0 or so but the recommendations seemed totally random.
Somebody suggested that ALS.trainImplicit did a better job, so I changed src/main/scala/ALSAlgorithm.scala accordingly:
val m = ALS.trainImplicit( // instead of ALS.train
ratings = mllibRatings,
rank = ap.rank,
iterations = ap.numIterations,
lambda = ap.lambda,
blocks = -1,
alpha = 1.0, // also added this line
seed = seed)
Scores are much lower now (below 1.0) but the recommendations are in line with the personal ratings. Much better, but also confusing. PredictionIO defines the difference between explicit and implicit this way:
explicit preference (also referred as "explicit feedback"), such as
"rating" given to item by users. implicit preference (also referred
as "implicit feedback"), such as "view" and "buy" history.
and:
By default, the recommendation template uses ALS.train() which expects explicit rating values which the user has rated the item.
source
Is the documentation wrong? I still think that explicit feedback fits my use case. Maybe I need to adapt the template with ALS.train in order to get useful recommendations? Or did I just misunderstand something?
A lot of it depends on how you gathered the data. Often ratings that seem explicit can actually be implicit. For instance, assume you give the option of allowing users to rate items that they have purchased / used before. This means that the very fact that they have spent their time evaluating that particular item means that the item is of a high quality. As such, items of poor quality are not rated at all because people do not even bother to use them. As such, even though the dataset is intended to be explicit, you may get better results because if you consider the results to be implicit. Again, this varies significantly based on how the data is obtained.
The explict data (like ratings) normally comes with bias - people go and rate a product because they like it! Think about your experience shopping and then rating on Amazon.com :-)
On the contrary, implict info often can truly reflect user's favor on a product, like viewing duration, comment length, etc. Even a like/dislike is better that rating because it provides a very simple 'bad' option without bothering a user to think "if I should give a 3, 3.5, or 4?".
Suppose I have a whole set of recipes in text format, with nothing else about them being known in advance. I must divide this data into 'recipes for baked goods' and 'other recipes'.
For a baked good, an excerpt from the recipe might read thusly:
"Add the flour to the mixing bowl followed by the two beaten eggs, a pinch of salt and baking powder..."
These have all been written by different authors, so the language and vocabulary is not consistent. I am in need of an algorithm or, better still, an existing machine learning library (implementation language is not an issue) that I can 'teach' to distinguish between these two types of recipe.
For example I might provide it with a set of recipes that I know are for baked goods, and it would be able to analyse these in order to gain the ability to make an estimate as to whether a new recipe it is presented with falls into this category.
Getting the correct answer is not critical, but should be reasonably reliable. Having researched this problem it is clear to me that my AI/ML vocabulary is not extensive enough to allow me to refine my search.
Can anyone suggest a few libraries, tools or even concepts/algorithms that would allow me to solve this problem?
What you are looking for is anomaly / outlier detection.
In your example, "baked goods" is the data you are interested in, and anything that doesn't look like what you have seen before (not a baked good) is an anomaly / outlier.
scikit learn has a limited number of methods for this. Another common method is to compute the average distance between data points, and then anything new that is more than the average + c*standard deviation is considered an outlier.
More sophisticated methods exist as well.
You can try case based reasoning.
Extract specific words or phrases that would put a recipe into the baked goods category. If it is not there it must be in other recipes.
You can get clever and add word sets {} so you don't need to look for a phrase.
Add weighting to each word and if it gets over a value put it into baked.
So {"oven" => "10", "flour" = > "5", "eggs" => "3"}
My reasoning is that if it is going in the "oven" it is likely to be getting baked. If you are going to distinguish between baking a cake and roasting a join then, this needs adjusted. Likewise "flour" is associated with something that is going to be baked as are eggs.
add pairs {("beaten", "eggs") => "5"} notice this is different from a phrase {"beaten eggs" => "10"} in that the worst in the pairs can appear anywhere in the recipe.
negatives {"chill in the fridge" => -10}
negators {"dust with flour" => "-flour"}
absolutes {"bake in the oven" => 10000} is just a way of saying {"bake in the oven" => "it is a baked good"} by having the number so high it will be over the threshold on its' own.
I am working on a naive bayes classifier that takes a bunch of user profile data such as:
Name
City
State
School
Email Address
URLS { ... }
The last bit is a bunch of urls that are search results for the user gathered by a google search for the user by name. The objective is to decide if the search result is accurate(ie. it is about the person) or inaccurate. In order to do this, each piece of the profile data is searched within each link in the url array and a binary value is assigned per attribute if that profile data (ex. City) is matched on a page. The results are then represented as a vector of binaries (ie. 1 0 0 0 1 means Name and Email address was matched on the url).
My questions revolves around creating the optimal training set. If a person's profile has incomplete information (such as missing email adddress), should that be a good profile to use in my training set? Should I be only training on profiles with full training information? Would it make sense to make different training sets (one for each combination of complete profile attributes) and then when i am given a user's url to test with, i determine which training set to use based on how much user profile is on record for the test person? How can i go about this?
In general, there is no "should". Whichever way you create a model, the only thing which matters is its performance, no matter how you created it.
However, it is highly unlikely you'd be able to create a proper model with hand-picked training set. The simple idea is that you should train your model on data which looks exactly like live data. Will live data have missing values, incomplete profiles etc? So, you need your model to know what to do in such situations and, therefore, such profiles should be in the training set.
Yes, certainly, you can make a model composed of several sub-models, however you might run into problems with having not enough training data and overfitting. You'll have to create multiple good models for it to work, which is harder. I suppose it would be better to leave this reasoning to the model itself rather than trying to hand-hold it into the right direction, this is what machine learning is for - save you the trouble... But there is really no way to say before trying it on your data set. Again, whatever works in your particular case is right.
Because you're using Naive Bayes as your model (and only because of that) you can benefit from the independence assumption to use every piece of data you have available and only consider those present in the new sample.
You have features f1...fn, some of which may or may not be present in any given entry. The posterior probability p( relevant | f_1 ... f_n ) decomposes as:
p( relevant | f_1 ... f_n ) \propto p( relevant ) * p( f_1 | relevant ) * p( f_2 | relevant ) ... p(f_n | relevant )
p( irrelevant | f_1...f_n ) is similar. If some particular f_i isn't present, just drop the terms from the two posteriors---given that they're defined over the same feature space probabilities are comparable, and can be normalised in the standard way. All you then need is to estimate the terms p( f_i | relevant ): this is simply the fraction of the relevant links where the i_th feature is 1 (possibly smoothed). To estimate this parameter simply use the set of relevant links where the i-th feature is defined.
This is only going to work if you implement yourself, as I don't think you can do this with a standard package, but given how easy it is to implement I wouldn't be concerned.
Edit: an example
Imagine you have the following features and data (they're binary, since you say that's what you have, but the extension to categorical or continuous is not difficult, I hope):
D = [ {email: 1, city: 1, name: 1, RELEVANT: 1},
{city: 1, name: 1, RELEVANT: 0},
{city: 0, email: 0, RELEVANT: 0}
{name: 1, city: 0, email: 1, RELEVANT: 1} ]
where each element of the list is an instance, and the target variable for classification is the special RELEVANT field (note that some of these instances have some variables missing).
You then want to classify the following instance, missing the RELEVANT field since that's what you're hoping to predict:
t = {email: 0, name: 1}
The posterior probability
p(RELEVANT=1 | t) = [p(RELEVANT=1) * p(email=0|RELEVANT=1) * p(name=1|RELEVANT)] / evidence(t)
while
p(RELEVANT=0 | t) = [p(RELEVANT=0) * p(email=0|RELEVANT=0) * p(name=1|RELEVANT=0)] / evidence(t)
where evidence(t) is just a normaliser obtained by summing the two numerators above.
To get each of the parameters of the form p(email=0|RELEVANT=1), look at the fraction of training instances where RELEVANT=1 which have email=0:
p(email=0|RELEVANT=1) = count(email=0,RELEVANT=1) / [count(email=0,RELEVANT=1) + count(email=1,RELEVANT=1)].
Notice that this term simply ignores instances for which email is not defined.
In this instance, the posterior probability of relevance goes to zero because the count(email=0,RELEVANT=1) is zero. So I would suggest using a smoothed estimator where you add one to every count, so that:
p(email=0|RELEVANT=1) = [count(email=0,RELEVANT=1)+1] / [count(email=0,RELEVANT=1) + count(email=1,RELEVANT=1) + 2].