I have excel sheet, where 3 column x1, x2 , x3 are present. x1,x2 have question and x3 have all the answer serially, I mean x1 and x2 1st row have question and that questions answer is x3 1st column. x1, and x2 have mixture of numerical and text data and have some NA value is also there.
Here my work is I have to using NLP technics to solve these issue, if I type x1 and x2 questions it will give x3 answer . so the question is not given full statement but some selected words, if I will give some selected keyword also it will be answer. Please guide me where and how I need to start . Please guide and sugest
It sounds (your question is a bit unclear) that you have a bunch of mixed data types, and you only want to process x1 = some text1 + x2 = some text2 -> x3 = some answer text.
I would recommend first cleaning up your data, you can easily remove NA's or NAN's by piping your data into a PANDAS dataframe (I'm unsure what language you're using either). If you're using python, you can also easily remove the numeric information by using the is.digit function.
I'm not entirely sure what you're trying to do, so I can't really recommend things to do after cleaning your data. Posting 2 examples of a proper and improper x1, x2 and x3 might be helpful.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Given a data set, such as:
(FirstName, LastName, Sex, DateOfBirth, HairColor, EyeColor, Height, Weight, Location)
that some model can train on, what kind of Machine Learning paradigm can be used to predict missing values if only given some of them?
Example:
Given:
(FirstName: John, LastName: Doe, Sex: M, Height: (5,10))
What model could predict the missing values?
(DateOfBirth, HairColor, EyeColor, Weight, Location)
In other words, the model should be able to take any of the fields as inputs, and "fill in" any that are missing.
And what type of ML/DL is this even called?
If you're looking to fill missing values with an algorithm, this is called imputing missing data. If you're using Python, the scikit-learn library has a number of imputation algorithms that you can explore in the docs.
One nice algorithm is KNNImputer, which looks n_neighbors most similar observations to the current observation and fills the missing data with mean for the column from those similar observations. Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
If there are a lot of missing values in a row, first need to understand: will it add value to my problem? Else drop such rows which have a lot of missing values.
One way to handle: Remove the target variable. Using the features which do not have missing values, predict columns that have missing values. Use ML algorithms, to predict and fill those values. Then again use previously imputed missing values to predict other missing values.
Eg: if features and target are: X1, X2, X3, X4, Y
Let X1 and X2 do not have missing values, X3 and X4 have missing values.
First, keep aside Y. Using X1 and X2, fill missing values in X3 with the help of ML algorithms. Again, using X1, X2, X3 fill missing values in X4. Then finally predict the target values (Y).
I have used this method in hackathons and got good results. Before applying this, first, try to get a good understanding of the data. The approach might be slightly different from what you have asked, but this is a decent approach for such problems.
Understanding Polynomial Regression.
I understand that we use polynomial regression for some kind of non Linear Data set and to give it a curve. I know the equation of writing a Polynomial Regression for single independent variable but i don't really understand how this equation is constructed for 2 variables?
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
What will be the equation for Polynomial Regression in case we have 3 or more variables? What actually is the logic behind developing these polynomial equation for more than one variable?
You can choose whatever you want, but the general 'formula' (to the best of my own experience and knowledge) is:
Powers (so x1, x1^2, x1^3 etc) up to whichever number you choose (many stop at 2).
Cross products (x1 * x2, x1 * x3 etc)
Combinations (x1^2 * x2, x1 * x2^2 etc) and then you can even add higher combinations (x1 * x2 * x3, and you can even add powers here).
But this gets quickly out of hand, and you can end up with too many features.
I would stick to powers of 2, and cross products (only pairs) with no powers, quite like your example, and if you have three elements, then the multiplication of all three of them, but if you have more than three, I wouldn't bother with triplets.
The idea with polynomials is that you model the complex relationship between the features and polynomials are sometimes a good approximation to more complex relationships (that are not really polynomial in their nature).
I hope this is what you meant and that this can help you.
I take a sentence from a note to someone and I am now wondering how this statement can be valid:
In constructing a decision tree for noise-free data, if a good feature
has not been selected for root, we still can create a consistent
hypothesis.
It doesn't make sense to me: why can we still create a consistent decision tree in this condition?
Remark: if f is the target function, we say hypothesis h is consistent if it agrees with f on all examples
Unless I have not misunderstood the question, I would say that statement is true.
It is not the simplest thing to come up with a quick example but here it is (sorry for my poor paint skills).
In this picture we can see how choosing as root the feature x2 returns the max information gain and allows us to find a consistent hypothesis with a minimal decision tree.
In fact h(x) = "cross" if x2 > 1.
This does not prevent us to find a consistent hypothesis choosing the worst feature, x1, as root.
Going this way we would have an initial branching over x1:
x1 < 1 from here we would then perform a successive branching over x2
x1 > 1 and from here the same.
This way we would obtain h(x) = "cross" if x1 < 1 && x2 > 1. h(x) = "circle" otherwise. Thus a consistent hypothesis again.
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.
I'm referring this paper - An Efficient Boosting Algorithm for Combining Preferences(http://www.ai.mit.edu/projects/jmlr/papers/volume4/freund03a/freund03a.pdf)
And implementing 3.3 - An efficient implementation for bipartite feedback.
Now I'm confused with distribution v when I have data from more than one queries - e.g. Q1 returns some doc (A,B,C) out of(A ... Z) as 1(Relevant) and other as 0, Q2 returns (D, N, V) as relevant. Hence different X0 and X1 set for different query.
Hence confused with initialization of V update of V, and calculating potential(pi). If anyone can suggest something that would be great.
EDIT:
I was thinking to make distribution (v of x0 and x1) for all queries such that it sums up to 1.0, and update distribution similar way ... please share your thoughts!
EDIT2:
If anyone is coming here then EDIT1 works!