How to deal with rasa nlu data imbalance problem? - embedding

Now I have 12 intents to identify,But the amount of data for each intent is inconsistent,Like meeting settings, reminding these intentions, the amount of data will be thousands.But like greetings, thank you for such an intention, there are very few data samples, maybe only a few dozen.
How do I deal with this data imbalance problem?
My config.yml file content is as follows:
language: en
pipeline:
- name: "WhitespaceTokenizer"
- name: "RegexFeaturizer"
- name: "CountVectorsFeaturizer"
analyzer: char_wb
min_ngram: 2
max_ngram: 5
stop_words: "english"
- name: "CRFEntityExtractor"
- name: "extractor.regex.RegexEntityExtractor"
- name: "EmbeddingIntentClassifier"
epochs: 100
num_neg: 2
- name: "DucklingHTTPExtractor"
url: "http://localhost:8000"
dimensions: ["time", "duration", "phone-number", "distance"]
policies:
- name: MemoizationPolicy
- name: EmbeddingPolicy
epochs: 20
- name: FormPolicy
- name: MappingPolicy
- name: FallbackPolicy
fallback_action_name: "action_default_fallback"

I don't know i have properly understood your question. As far as i understood you don't have to worry about those intents like greet, deny have few data(examples) and others have thousand data(examples).
The problem occurs when you try to deal with multiple intents and those intents are different from each other in a very small way. In situation like that if you do not provide proper and correct data to RASA it will confuse and might give wrong output. You should worry about how to make those data different for each intent and make RASA less confuse so you can get right output.

Here is Rasa documentation, I hope you get what you need.
Classification algorithms often do not perform well if there is a large class imbalance, for example if you have a lot of training data for some intents and very little training data for others. To mitigate this problem, rasa’s supervised_embeddings pipeline uses a balanced batching strategy.

Related

Experiment tracking for multiple ML independent models using WandB in a single main evaluation

Can you recommend from your experience about choosing a convenient tracking experiment tool and versioning only "Multi independent models, but one input->multi-models->one output" in order to get single main evaluation and conveniently compare sub-evaluations? see a project example in the diagram.
I understand and tried to use W&B, MLFlow, DVC, Neptune.ai, DagsHub, TensorBoard for only one model, but I'm not sure one is convenient to use for multi-independent models. I also did not find it in Google for the approximate phrase "ML tracking experiment and management for multi models"
Disclaimer: I'm co-founder at Iterative, we are authors of DVC. My response doesn't come from my experience with all the tools mentioned above. I took this as an opportunity to try build a template for this use case in the DVC ecosystem and share this in case it's useful for anyone.
Here is the GitHub repo, I've built (Note: it's a template, not a real ML project, scripts are artificially simplified to show the essence of the multi model evaluation):
DVC Model Ensemble
I've put together an extensive README with a few videos of CLI, VS Code, Studio tools.
The core part of the repo is this DVC pipeline, that "trains" multiple models, collects their metrics, and then runs evaluation stage to "reduce" those metrics into the final one.
stages:
train:
foreach:
- model-1
- model-2
do:
cmd: python train.py
wdir: ${item}
params:
- params.yaml:
deps:
- train.py
- data
outs:
- model.pkl:
cache: false
metrics:
- ../dvclive/${item}/metrics.json:
cache: false
plots:
- ../dvclive/${item}/plots/metrics/acc.tsv:
cache: false
x: step
y: acc
evaluate:
cmd: python evaluate.py
deps:
- dvclive
metrics:
- evaluation/metrics.json:
cache: false
It describes how to build and connect different things in the project, also makes the project "runnable" and reproducible. It can scale to any number of models (the first foreach clause).
Please, let me know if that fits your scenario and/or you have more requirements, happy to learn mode and iterate on it :)

NLP Categorizing Details with Confidence Values

Background
I'm writing a Swift application that requires the classification of user events by categories. These categories can be things like:
Athletics
Cinema
Food
Work
However, I have a set list of these categories, and do not wish to make any more than the minimal amount I believe is needed to be able to classify any type of event.
Question
Is there a machine learning (nlp) procedure that does the following?
Takes a block of text (in my case, a description of an event).
Creates a "percentage match" to each possible classification.
For instance, suppose the description of an event is as follows:
Fun, energetic bike ride for people of all ages.
The algorithm in which this description would be passed in would return an object that looks something like this:
{
athletics: 0.8,
cinema: 0.1,
food: 0.06,
work: 0.04
}
where the values of each key in the object is a confidence.
If anyone can guide me in the right direction (or even send some general resources or solutions specific to iOS dev), I'd be super appreciative!
You are talking about typical classification model. I believe iOS offers you APIs to do this inside your app. Here Look for natural language processing bit - NLP
Also you are probably being downvoted because this forum typically looks to solve specific programming queries and not generic ones (this is an assumption and there could be another reason for downvotes.)

best algorithm to predict 3 similar blogs based on a blog props and contents only

{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

Code Climate - Too Complex Error

I'm using code climate on one of my projects, and I'm getting an error for have "too complex" of code. I'm not sure how to make the code it's calling out less complex? Here it is:
Method:
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: {
amount: payment_details.coupon.amount,
type: payment_details.coupon.coupon_type,
name: payment_details.coupon.name,
valid: payment_details.coupon.valid_coupon?,
}
}
end
It's just JSON that I tucked away in a model. Everything on my branch is great expect for this? I'm not sure what to do? Any ideas on how I can make this less complex?
I wouldn't care too much if Code Climate thinks something is too complex, but actually is easy to understand. Code Climate should help you to write better, easy to read code. But it doesn't provide hard rules.
If you really want to change something, you might want to move the generation of the coupon sub hash to the Coupon model, because it only depends on values provided by the coupon association:
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: payment_details.coupon.as_json
}
end
# in coupon.rb
def as_json
{
amount: amount,
type: coupon_type,
name: name,
valid: valid_coupon?
}
end
A similar refactoring can be done with payment_details but it is not sure, where this attribute is coming from and if it is an associated model.
Please just ignore the complexity warnings.
They are misguided.
These warnings are based on fake science.
Cyclomatic complexity has been proposed in 1976 in an academic journal and has alas been adopted by tool builders because it is easy to implement.
But that original research is flawed.
The original paper proposes a simple algorithm to compute complexity for Fortran code but does not give any evidence that the computed number actually correlates to readability and understandability of code. Nada, niente, zero, zilch.
Here is their abstract
This paper describes a graph-theoretic complexity measure and
illustrates how it can be used to manage and control program
complexity. The paper first explains how the graph-theory concepts
apply and gives an intuitive explanation of the graph concepts in
programming terms. The control graphs of several actual Fortran
programs are then presented to illustrate the correlation between
intuitive complexity and the graph-theoretic complexity. Several
properties of the graph-theoretic complexity are then proved which
show, for example, that complexity is independent of physical size
(adding or subtracting functional statements leaves complexity
unchanged) and complexity depends only on the decision structure of a
program.
The issue of using non structured control flow is also
discussed. A characterization of non-structured control graphs is given
and a method of measuring the "structuredness" of a program is
developed. The relationship between structure and reducibility is
illustrated with several examples.
The last section of this paper
deals with a testing methodology used in conjunction with the
complexity measure; a testing strategy is defined that dictates that a
program can either admit of a certain minimal testing level or the
program can be structurally reduced
Source http://www.literateprogramming.com/mccabe.pdf
As you can see anecdotical evidence only is given "to illustrate the correlation between intuitive complexity and the graph-theoretic complexity" and the only proof is that code can be rewritten to have a lower complexity number as defined by this metric. Which is a pretty non-sensical proof for a complexity metric and very common for the quality of research from that time. This paper would not be publishable by today's standards.
The authors of the paper have not done an user research and their algorithm is no grounded in any actual evidence. And no research has been able to prove a link between cyclomatic complexity and code comprehension since. Not to mention that this complexity metric was proposed for Fortran rather than modern high level languages.
The best way to ensure code comprehension is code review. Just simply ask another person to read your code and fix whatever they don't understand.
So just turn these warning off.
You're trying to describe a transformation of data from one complex structure to another using code and that creates a lot of "complexity" in the eyes of a review tool like Code Climate.
One thing that might help is to describe the transformation in terms of data:
PAYMENT_DETAILS_PLAN = {
total_ticket_count: [ :tickets, :count ],
subtotal: [ :subtotal, :to_f ],
discount: [ :discount, :to_f ],
fees: [ :fees_total, :to_f ],
total: [ :total_in_dollars, :to_f ],
coupon: {
amount: [ :coupon, :amount ],
type: [ :coupon, :coupon_type ],
name: [ :coupon, :name ],
valid: [ :coupon, :valid_coupon? ]
}
}
This might not seem like a huge change, and really it isn't, but it offers some measurable benefits. The first is that you can reflect on it, you can inspect the configuration using code. The other is once you've laid down this format you might be able to write a DSL to manipulate it, to filter or augment it, among other things. In other words: It's flexible.
Interpreting that "plan" isn't that hard:
def distill(obj, plan)
plan.map do |name, call|
case (call)
when Array
[ name, call.reduce(obj) { |o, m| o.send(m) } ]
when Hash
[ name, distill(obj, plan) ]
end
end.to_h
end
When you put that into action this is what you get:
def apply_json
distill(payment_details, PAYMENT_DETAILS_PLAN)
end
Maybe that approach helps in other situations where you're doing largely the same thing.
You may extract the coupon subhash as the method returning that subhash. It will reduce the code complexity (as for codeclimate) but it's not really necessary. However some people believes that method must have 5 strings or less. And they are right in the most use cases. But it's totally up to you to decide.
def apply_json
{
total_ticket_count: payment_details.tickets.count,
subtotal: payment_details.subtotal.to_f,
discount: payment_details.discount.to_f,
fees: payment_details.fees_total.to_f,
total: payment_details.total_in_dollars.to_f,
coupon: subhash
}
end
def subhash
{
amount: payment_details.coupon.amount,
type: payment_details.coupon.coupon_type,
name: payment_details.coupon.name,
valid: payment_details.coupon.valid_coupon?,
}
end

In Weka make a arff file from text file

In naive byes classifier i want to find out the accuracy from my train and test. But my train set is like
Happy: absolution abundance abundant accolade accompaniment accomplish accomplished achieve achievement acrobat admirable admiration adorable adoration adore advance advent advocacy aesthetics affection affluence alive allure aloha
Sad: abandon abandoned abandonment abduction abortion abortive abscess absence absent absentee abuse abysmal abyss accident accursed ache aching adder adrift adultery adverse adversity afflict affliction affront aftermath aggravating
Angry: abandoned abandonment abhor abhorrent abolish abomination abuse accursed accusation accused accuser accusing actionable adder adversary adverse adversity advocacy affront aftermath aggravated aggravating aggravation aggression aggressive aggressor agitated agitation agony alcoholism alienate alienation
For test set
data: Dec 7, 2014 ... This well-known nursery rhyme helps children practice emotions, like happy, sad, scared, tired and angry. If You're Happy and You Know It is ...
Now the problem is how do i convert them into arff file
Your training set is not appropriate for training a model for Weka however these information can be used in feature extraction.
Your Test set can be converted into an arff file. From every message extract these basics features like
1. Any form of the word 'Happy' is present or not
2. Any form of the word 'Sad' is present or not
3. Any form of the word 'Angry' is present or not
4. TF-IDF
etc.
then for some messages (say 70%) you should assign one class {Happy, Sad, Angry} manually and for remaining 30% you can test through your model.
More about arff file is given here:
http://www.cs.waikato.ac.nz/ml/weka/arff.html
Where to start ;).
As written before your "training data" is not real training data. Training data should be texts similar to the data you are using for Testing. However, in your example it is merely a list of words. My gut feeling is that you would be better of to avoid using weka, count the number of occurrence in each category, and the take the one with most matches.
In case you want use Weka I'd recommend to use the toolbox https://www.knime.org which nicely integrates with weka.
You should then convert your data into a bag of words representation. This is basically you have the number of times each word occurs in each of the texts as features.
Also for this Knime has nice package. http://www.tech.knime.org/files/KNIME-TextProcessing-HowTo.pdf

Resources