Cleaning text Data For NLP tasks - machine-learning

This morning i've been trying to train a chatbot on the Cornell Movie--Dialogs Corpus Dataset but i'm facing problems cleaning the text data to feed into my Algorithm.
Here is snippet from the text file
L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
I am only interested in the dialogs at last part of each sentence.
How can i clean this file and make it a csv document?
Dataset Link
http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

Iterate over all the lines as a string.
Lets say you hav:
str = "+++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!"
and you want out put as "They do not!"
Do like :
str.split("+++$+++ ")[-1]
This will give you the desired output. Once you have the desires output as string write them line by line in your .csv file.
Hope this helps.

Well, you can do this using simple regex.
Code Snippet
import re
string = "+++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!"
cleaned = " ".join(re.findall("[a-zA-Z]+", string))
print(cleaned)
Output:
'u m BIANCA They do not'
To perform it on every line. I suggest you convert your data to pandas data frame and use .apply() method to get your cleaning done

Try this library for basic cleaning : https://pypi.org/project/textcleaner/
There is a function named remove_symbols() you can also pass a list instead of a file as an argument.
Below is the documentation link to use this function.
https://yugantm.github.io/textcleaner/documentation.html#remove_symbols
There are many other functions for the cleaning of text data.
I hope this will help :)

You can separate the text into columns using '+++\$+++' as the separator
df=pd.read_csv('training_data/movie_lines.txt', sep = '\+\+\+\$\+\+\+', engine = 'python', header=None)
You will get something like below, then you can drop the columns that you don't want to use.
0 1 2 3 4
0 L1045 u0 m0 BIANCA They do not!
1 L1044 u2 m0 CAMERON They do to!
2 L985 u0 m0 BIANCA I hope so.
3 L984 u2 m0 CAMERON She okay?
4 L925 u0 m0 BIANCA Let's go.
Now to drop the unnecessary columns use:
df.drop(df.columns[0:4], axis=1, inplace=True)
4
0 They do not!
1 They do to!
2 I hope so.
3 She okay?
4 Let's go.

The pattern is right there! +++$+++. Split it on that and you will get CSV data.

Related

Analytic solution to an equation including the error function in Maxima

Maxima does not seem to come up with an analytic solution to this equation which includes the error function. The independent variable here is "p" and the dependent variable to be solved for is "x".
see an illustration of equation follow link
(%i3) solveexplicit:true$ ratprint:false$ fpprintprec:6$
(%i4) eqn: (sqrt(%pi)*(25*2^(3/2)*p-25*sqrt(2))*erf(1/(25*2^(3/2)*x))*x+1)/(25*p) = 0.04;
(%i5) solve (eqn, x);
(%o5) []
(%i6) eqn, [p=2,x=0.00532014],numer;
(%o6) 0.04=0.04
Any help or pointing in the right direction is appreciated.
As far as I know, Maxima can't solve equations containing erf. You can get a numerical result via find_root:
(%i5) find_root (eqn, x, 0.001, 0.999), p=2;
(%o5) 0.005320136894034347
As for symbolic solutions, I worked with the equation a little bit. One can get it into the form erf(something/x)*x = otherstuff, or equivalently erf(y) = somethingelse*y where y = something/x and somethingelse = otherstuff/something if I'm not mistaken. I don't know anything in particular about equations of that form, but perhaps you can find something.
Yes, solve can only do polynominals. I used the series expansion for small values of x and the accuracy is good enough.
(%i11) seriesE: 1$
termE: erf(x)$
for p: 1 unless p > 3 do
(termE: diff (termE, x)/p,
seriesE: seriesE + subst (x=0, termE)*x^p)$
seriesE;
(%o11) -(2*x^3)/(3*sqrt(%pi))+(2*x)/sqrt(%pi)+1
However, the "Expression longer than allowed by the configuration setting!"

Maxima does not solve the system sqrt(x)=1, y=1 with the solve function

I'm trying to solve system of equations with roots in maxima, for example:
solve([sqrt(x) = 1, y = 1], [x,y]);
But maxima says that this system has no solutions. On the other hand, maxima is able to solve this equation:
solve([sqrt(x) = 1], [x]);
Can I solve systems like the above in maxima?
The built-in solve has serious limitations. The add-on function to_poly_solve can solve equations which contain radicals; I don't know what its limitations are.
(%i2) load (to_poly_solve);
(%o2) /usr/local/share/maxima/5.40.0/share/to_poly_solve/to_poly_solve.mac
(%i3) to_poly_solve ([sqrt(x) = 1, y = 1], [x,y]);
(%o3) %union([x = 1, y = 1])
%union means a union of solutions. Since there is only one solution found, %union could be simplified away; its presence is perhaps a little inconvenient, but not incorrect.

Minimize a equation using opencv

I need to solve the following equation:
I Know the matrix G, how can I find the the matrix p subject to ||p|| = 1.
Currently I am solving in opencv as follows:
Mat w, u, EigenVectors;
SVD::compute(A, w, u, EigenVectors);
Mat p = EigenVectors.row(EigenVectors.rows-1);
I want to know how can I ensure the condition ||p|| = 1.
Also I want to know the significance and meaning of other rows/cols of the EigenVectors(transposed) ?
I believe you can use cv::SVD::solveZ(). It finds a unit-length solution x of a singular linear system A * x = 0
Looks like you need to use Lagrange multipliers method.
As I know, OpenCV have no ready to use tools for that.
Good example for MATLAB: Lagrange Multipliers

More compact Solution in Maxima

I have the following code:
for n:1 thru 11 do for j:1 thru 21 do v[n,j]:1/sqrt(dp)*
(sum(eigenfunctionsort[n,j]*exp(%i*2*%pi*m*x/dp),m,-10,10));
Where eigenfunctionsort is defined earlier,x is a variable I will integrate over later and I am summing over m.
When I print say v[1,1], I get a big long nasty equation. How can i have Maxima boil this down in to something meanigful so I can check my results.
Best,
Ben
Try the 'trigsimp' function or maybe map(trigsimp, your_expression). Not sure it will help, but it's worth a try. Also look at 'demoivre'.
I don't know what your vector eigenfunctionsort or your "big long nasty equation" looks like, but I often get complex eigenvalues and eigenvectors from Maxima even when I know they should be simple and real.
For example,
(%i1) A : matrix([1, 4, 1], [4, 1, 9], [1, 9, 1]);
(%i2) eigenvalues(A);
makes a mess. It can be simplified by applying rectform to transform the output to Cartesian form, followed by trigreduce to reduce the imaginary part of the result. Finally you can convert the
result into floating point:
(%i3) rectform(%)$
(%i4) trigreduce(%)$
(%i5) float(%);

Curve fitting on Lua

I'm looking for algorithms to do curve fitting from tabular XY data to a Gaussian function (a.k.a. bell curve). By googling I can find few gaussian fitting algos for Matlab, here are couple of them:
https://ccrma.stanford.edu/~jos/sasp/Fitting_Gaussian_Data.html
http://jila.colorado.edu/bec/BEC_for_everyone/matlabfitting.htm
One seems to use 'polyfit' function of Matlab for the job.
Anyone seen readily made algo for Lua language (either gaussian or polyfit)? If not, I would greatly appreciate one's help for creating/porting such algorithm as it would probably consume a day with my limited Lua skills.
Here's my attempt to extract gaussian fit out of noisy measurement data.
require 'gsl'
require 'math'
--x=x coordinates, y=y coordinates
--clip=relative clip/ignore level 0..1 (i.e 0.1 removes values below 10% of max amplitide)
--removeoffset=set to true if y data offset should be removed
function gaussianFit( x, y, clip, removeoffset )
local xx = {}
local yy = {}
local yoffset=0
if removeoffset==nil or removeoffset==false then
else --remove y data offset
yoffset=gsl.Vector(y):min()
end
local ymax=gsl.Vector(y):max()-yoffset
--pick only data points that has y coord larger than clip level
for i=1,#x do
if (y[i]-yoffset) > (clip*ymax) then
table.insert(xx, x[i])
table.insert(yy, math.log(y[i]-yoffset))
end
end
local xvect = gsl.Vector(xx)
local yvect = gsl.Vector(yy)
--fit to polynomial
local poly3 = gsl.fit.poly(3) -- a third degree polynomial
local fit = gsl.lsfit({xvect, poly3}, yvect, nil, "fmulti") -- fits xx and yy with poly3
--convert to gauss coeffs
local A2=fit:coeffs()[3]
local A1=fit:coeffs()[2]
local A0=fit:coeffs()[1]
local sigma=math.sqrt(-1/(2*A2))
local mu=A1*math.pow(sigma,2)
local A=math.exp(A0+math.pow(mu,2)/(2*math.pow(sigma,2)))
return sigma, mu, A
end
xx={1, 2, 3, 4, 5, 6, 7, 8, 9}
yy={1, 2, 4, 6, 4, 3, 2, 1, 1}
sigma,mu,A=gaussianFit(xx,yy,0.1,false)
print(sigma.." "..mu.." ".. A)
--prints 2.2829275461334 4.6387484511153 4.201115115886
You can rearrange the equation to a linear form, and then use the methods described in Linear Regression by Paul Bourke (a little down the page).
If you need, I can demonstrate the rearranging process for you.
If you really need, I can provide an implementation of the line-of-best-fit algorithm in Lua.

Resources