My main question is in data mining and machine learning area. What is the objective of the data collection process?
any help ?
Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.
Data collection itself is usually not considered to be part of data mining itself. It is assumed the data is already there (the mountain), and you need to find the gold nuggets in there.
Related
Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!
To resolve the problem of non-iid data in federated learning, I read a paper which add a new node with a different data domain and transfer knowledge from decentralized nodes. My question is what is the information transfered, is that updates or data ?
In layman terms, non-idd means that not all class labels are distributed evenly between clients for training. For obvious reasons, in federated environment it is not feasible for every client to hold and train on idd data. With regards your specific query of how it works in the paper mentioned in your question, you may please share the link of the paper.
I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials
My problem statement is this: The user is facing a problem. The user inputs a basic description of the problem Based on keywords in the input by user, the program would ask more specific questions to find out more about the problem. Then based on that, the program would ask questions more specific to the keywords provided in the second answer.
In the end the program will be able to produce a detailed description of the problem that the user is facing.
My question is that I am trying to do this via machine learning and what to know what category this falls under.
Is it Neural networks or supervised or unsupervised or completely something else?
Sounds like that this is related to Natural Language Processing (NLP). It is a category that deals with text understanding and human-computer interaction using natural languages. There exists both deep learning methods that uses neural networks in addition to some statistical methods. Your task is similar to a Chatbot I guess. With a quick search I found this:
https://uxplanet.org/nlp-vs-ci-who-is-the-king-of-chatbot-2f9d2e09f085. Hope this helps.
The question is not very clear. But, I am trying to give answer based on my understanding.
My general guess is that, you are talking about a chatbot which try to generate a problem description by continuing conversation with an user. It falls under the category of natural language processing (NLP).
Q. Is it Neural networks or supervised or unsupervised or completely something else?
It depends on how do you model your problem in terms of data. If you have data then you can formulate it as a supervised problem, and can use neural network to solve it.
But, I suspect you don't have the dataset for this particular task. In that case, you can look into many question-answer datasets. Otherwise, the problem can be solved in a algorithmic way too, in that case, no data is required.
As far as I can understand a user is interacting with a bot. If that is the case it is related to Natural Language Processing(NLP) as the data is in unstructured form i.e. text data. These type of data can be solved with either Supervised Learning or you can use Neural Network depending upon the complexity of the data and what is your objective.
Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)