I'm working with maltparser, nltk for process texts. Well i have a integration between maltparser and nltk that works fine. But since every time i execute the program nltk call java VE this take a lot of time... So i think make a webservice who takes conll .txt and return conll parsed by java app.
Well the problem come when i test examples from maltparser sources. I pick one from just initialize model and parser a array of tokens. I just change de model to the regular english one (engmalt.linear-1.7.mco). So execute and return the sentences just like input.
The code is this
public static void main(String[] args) {
// Loading the Swedish model swemalt-mini
ConcurrentMaltParserModel model = null;
try {
URL swemaltMiniModelURL = new File("inputs/engmalt.linear-1.7.mco").toURI().toURL();
System.out.println(swemaltMiniModelURL.getFile());
model = ConcurrentMaltParserService.initializeParserModel(swemaltMiniModelURL);
} catch (Exception e) {
e.printStackTrace();
}
// Creates an array of tokens, which contains the Swedish sentence 'Samtidigt får du högsta sparränta plus en skattefri sparpremie.'
// in the CoNLL data format.
String[] tokens = new String[5];
tokens[0] = "1\tThis\t_\tDT\tDT\t_\t0\ta\t_\t_";
System.out.println(tokens[0]);
tokens[1] = "2\tis\t_\tVBZ\tVBZ\t_\t0\ta\t_\t_";
System.out.println(tokens[1]);
tokens[2] = "3\ta\t_\tZ\tZ\t_\t0\ta\t_\t_";
System.out.println(tokens[2]);
tokens[3] = "4\ttest\t_\tNN\tNN\t_\t0\ta\t_\t_";
System.out.println(tokens[3]);
tokens[4] = "5\t.\t_\tFp\tFp\t_\t0\ta\t_\t_";
System.out.println(tokens[4]);
try {
String[] outputTokens = model.parseTokens(tokens);
ConcurrentUtils.printTokens(outputTokens);
} catch (Exception e) {
e.printStackTrace();
}
}
and the output is:
/home/tomas/workspace/PruebaMalt/inputs/engmalt.linear-1.7.mco
1 This _ DT DT _ 0 a _ _
2 is _ VBZ VBZ _ 0 a _ _
3 a _ Z Z _ 0 a _ _
4 test _ NN NN _ 0 a _ _
5 . _ Fp Fp _ 0 a _ _
1 This _ DT DT _ 0 a _ _
2 is _ VBZ VBZ _ 0 a _ _
3 a _ Z Z _ 0 a _ _
4 test _ NN NN _ 0 a _ _
5 . _ Fp Fp _ 0 a _ _
I try with others models and languages and the same... Any suggestions? ty!
I discovered by myself. The problem is that nlkt send to java this format:
1 This _ DT DT _ 0 a _ _
and return: 1 This _ DT DT _ 2 SUBJ _ _
But in java the format is a little different, the last 2 _ has to be removed. With that, it'll work!
input: 1 This _ DT DT _
return: 1 This _ DT DT _ 2 SUBJ _ _
I hope this help others.
Related
I built and ran Syntaxnet successfully on a set of 1400 tweets. I have difficulty in understanding what each parameter in the parsed file means. For example, I have the sentence:
Shoutout #Aetna for covering my doctor visit. Love you!
for which the parsed file contents are:
1 Shoutout _ NOUN NNP _ 9 nsubj _ _
2 # _ ADP IN _ 1 prep _ _
3 Aetna _ NOUN NNP _ 2 pobj _ _
4 for _ ADP IN _ 1 prep _ _
5 covering _ VERB VBG _ 4 pcomp _ _
6 my _ PRON PRP$ _ 8 poss _ _
7 doctor _ NOUN NN _ 8 nn _ _
8 visit. _ NOUN NN _ 5 dobj _ _
9 Love _ VERB VBP _ 0 ROOT _ _
10 you _ PRON PRP _ 9 dobj _ _
11 ! _ . . _ 9 punct _ _
What exactly do each of the columns mean? Why are there blanks and numbers other than the POS tags?
This type of format is called CoNLL Format. There are various versions available of it. The meaning of each column is described here
I have used Stanford Parser to parse some of my already tokenized and POS tagged (by Stanford POS tagger with Gate Twitter model). But the resulting conll 2007 formatted output does not include any punctuations. Why is that?
The command I have used:
java -mx16g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat conll2007 edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ..test.tagged > ../test.conll
e.g.
Original tweet:
bbc sp says they don't understand why the tories aren't 8% ahead in the polls given the current economics stats ; bbc bias ? surely not ?
POS tagged tweet, used as input for Stanford parser:
bbc§NN sp§NN says§VBZ they§PRP don't§VBP understand§VB why§WRB the§DT tories§NNS aren't§VBZ 8%§CD ahead§RB in§IN the§DT polls§NNS given§VBN the§DT current§JJ economics§NNS stats§NNS ;§: bbc§NN bias§NN ?§. surely§RB not§RB ?§.
Resulting conll 2007 formatted parse:
1 bbc _ NN NN _ 2 compound _ _
2 sp _ NN NN _ 3 nsubj _ _
3 says _ VBZ VBZ _ 0 root _ _
4 they _ PRP PRP _ 5 nsubj _ _
5 don't _ VBP VBP _ 3 ccomp _ _
6 understand _ VB VB _ 5 xcomp _ _
7 why _ WRB WRB _ 10 advmod _ _
8 the _ DT DT _ 9 det _ _
9 tories _ NNS NNS _ 10 nsubj _ _
10 aren't _ VBZ VBZ _ 6 ccomp _ _
11 8% _ CD CD _ 12 nmod:npmod _ _
12 ahead _ RB RB _ 15 advmod _ _
13 in _ IN IN _ 15 case _ _
14 the _ DT DT _ 15 det _ _
15 polls _ NNS NNS _ 10 nmod _ _
16 given _ VBN VBN _ 15 acl _ _
17 the _ DT DT _ 19 det _ _
18 current _ JJ JJ _ 19 amod _ _
19 economics _ NNS NNS _ 16 dobj _ _
20 stats _ NNS NNS _ 19 dep _ _
22 bbc _ NN NN _ 23 compound _ _
23 bias _ NN NN _ 20 dep _ _
25 surely _ RB RB _ 26 advmod _ _
26 not _ RB RB _ 16 neg _ _
As you can see, Most of the punctuations are not included in the parse. But why?
I think adding "-parse.keepPunct" to your command will fix this issue. Please let me know if that doesn't work.
Finally, found the answer, use
-outputFormatOptions includePunctuationDependencies
Have contacted Stanford parser and corenlp support long time ago, no response at all
I have a function that I use a lot and hence the performance needs to be as good as possible. It takes data from excel and then sums, averages or counts over parts of the data based on whether the data is within a certain period and whether it is a peak hour (Mo-Fr 8-20).
The data is usually around 30,000 rows and 2 columns (hourly date, value). One important feature of the data is that the date column is chronologically ordered
I have three implementations, c# with extension methods (dead slow and I m not going to show it unless somebody is interested).
Then I have this f# implementation:
let ispeak dts =
let newdts = DateTime.FromOADate dts
match newdts.DayOfWeek, newdts.Hour with
| DayOfWeek.Saturday, _ | DayOfWeek.Sunday, _ -> false
| _, h when h >= 8 && h < 20 -> true
| _ -> false
let internal isbetween a std edd =
match a with
| r when r >= std && r < edd+1. -> true
| _ -> false
[<ExcelFunction(Name="aggrF")>]
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Seq.averageBy (fun (_, value) -> value)
| 2 -> newd |> Seq.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Seq.sumBy (fun (_, value) -> value)
I see two issues with this:
I need to prepare the data because both date and value are double[]
I do not utilize the knowledge that dates are chronological hence I do unnecessary iterations.
Here comes now what I would call a brute force imperative c# version:
public static bool ispeak(double dats)
{
var dts = System.DateTime.FromOADate(dats);
if (dts.DayOfWeek != DayOfWeek.Sunday & dts.DayOfWeek != DayOfWeek.Saturday & dts.Hour > 7 & dts.Hour < 20)
return true;
else
return false;
}
[ExcelFunction(Description = "Aggregates HFC/EG into average or sum over period, start date inclusive, end date exclusive")]
public static double aggrI(double[] dts, double[] vals, double std, double edd, string pob, double sumavg)
{
double accsum = 0;
int acccounter = 0;
int indicator = 0;
bool peakbool = pob.Equals("Peak", StringComparison.OrdinalIgnoreCase);
bool offpeakbool = pob.Equals("Offpeak", StringComparison.OrdinalIgnoreCase);
bool basebool = pob.Equals("Base", StringComparison.OrdinalIgnoreCase);
for (int i = 0; i < vals.Length; ++i)
{
if (dts[i] >= std && dts[i] < edd + 1)
{
indicator = 1;
if (peakbool && ispeak(dts[i]))
{
accsum += vals[i];
++acccounter;
}
else if (offpeakbool && (!ispeak(dts[i])))
{
accsum += vals[i];
++acccounter;
}
else if (basebool)
{
accsum += vals[i];
++acccounter;
}
}
else if (indicator == 1)
{
break;
}
}
if (sumavg == 0)
{
return accsum / acccounter;
}
else if (sumavg == 2)
{
return acccounter;
}
else
{
return accsum;
}
}
This is much faster (I m guessing mainly because of the exit of loop when period ended) but oviously less succinct.
My questions:
Is there a way to stop iterations in the f# Seq module for sorted series?
Is there another way to speed up the f# version?
can somebody think of an even better way of doing this?
Thanks a lot!
Update: Speed comparison
I set up a test array with hourly dates from 1/1/13-31/12/15 (roughly 30,000 rows) and corresponding values. I made 150 calls spread out over the date array and repeated this 100 times - 15000 function calls:
My csharp implementation above (with string.compare outside of loop)
1.36 secs
Matthews recursion fsharp
1.55 secs
Tomas array fsharp
1m40secs
My original fsharp
2m20secs
Obviously this is always subjective to my machine but gives an idea and people asked for it...
I also think one should keep in mind this doesnt mean recursion or for loops are always faster than array.map etc, just in this case it does a lot of unnecessary iterations as it doesnt have the early exit from iterations that the c# and the f# recursion method have
Using Array instead of List and Seq makes this about 3-4 times faster. You do not need to generate a list of indices and then map over that to lookup items in the two arrays - instead you can use Array.zip to combine the two arrays into a single one and then use Array.filter.
In general, if you want performance, then using array as your data structure will make sense (unless you have a long pipeline of things). Functions like Array.zip and Array.map can calculate the entire array size, allocate it and then do efficient imperative operation (while still looking functional from the outside).
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let newd =
Array.zip data data2
|> Array.filter (fun (date, _) ->
let dateInRange = isbetween date std edd
match pob with
| "Peak" -> ispeak date && dateInRange
| "Offpeak" -> not(ispeak date) && dateInRange
| _ -> dateInRange)
match sac with
| 0 -> newd |> Array.averageBy (fun (_, value) -> value)
| 2 -> newd |> Array.sumBy (fun (_, value) -> 1.0)
| _ -> newd |> Array.sumBy (fun (_, value) -> value)
I also changed isbetween - it can be simplified into just an expression and you can mark it inline, but that does not add that much:
let inline isbetween r std edd = r >= std && r < edd+1.
Just for completeness, I tested this with the following code (using F# Interactive):
#time
let d1 = Array.init 1000000 float
let d2 = Array.init 1000000 float
aggrF d1 d2 0.0 1000000.0 "Test" 0
The original version was about ~600ms and the new version using arrays takes between 160ms and 200ms. The version by Matthew takes about ~520ms.
Aside, I spent the last two months at BlueMountain Capital working on a time series/data frame library for F# that would make this a lot simpler. It is work in progress and also the name of the library will change, but you can find it in BlueMountain GitHub. The code would look something like this (it uses the fact that the time series is ordered and uses slicing to get the relevant part before filtering):
let ts = Series(times, values)
ts.[std .. edd] |> Series.filter (fun k _ -> not (ispeak k)) |> Series.mean
Currently, this will not be as fast as direct array operations, but I'll look into that :-).
An immediate way to speed it up would be to combine these:
[0 .. (Array.length data) - 1]
|> List.map (fun i -> (data.[i], data2.[i]))
|> Seq.filter (fun (date, _) ->
into a single list comprehension, and also as the other matthew said, do a single string comparison:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let data = [ for i in 0 .. (Array.length data) - 1 do
let (date, value) = (data.[i], data2.[i])
if isbetween date std edd && isValidTime date then
yield (date, value)
else
() ]
match sac with
| 0 -> data |> Seq.averageBy (fun (_, value) -> value)
| 2 -> data.Length
| _ -> data |> Seq.sumBy (fun (_, value) -> value)
Or use a tail recursive function:
let aggrF (data:float[]) (data2:float[]) std edd pob sac =
let isValidTime = match pob with
| "Peak" -> (fun x -> ispeak x)
| "Offpeak" -> (fun x -> not(ispeak x))
| _ -> (fun _ -> true)
let endDate = edd + 1.0
let rec aggr i sum count =
if i >= (Array.length data) || data.[i] >= endDate then
match sac with
| 0 -> sum / float(count)
| 2 -> float(count)
| _ -> float(sum)
else if data.[i] >= std && isValidTime data.[i] then
aggr (i + 1) (sum + data2.[i]) (count + 1)
else
aggr (i + 1) sum count
aggr 0 0.0 0
I want to use Stanford Parser to create a .conll file for further processing.
So far I managed to parse the test sentence with the command:
stanford-parser-full-2013-06-20/lexparser.sh stanford-parser-full-2013-06-20/data/testsent.txt > output.txt
Instead of a txt file I would like to have a file in .conll. I'm pretty sure it is possible, at it is mentioned in the documentation (see here). Can I somehow modify my command or will I have to write Javacode?
Thanks for help!
If you're looking for dependencies printed out in CoNLL X (CoNLL 2006) format, try this from the command line:
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx
Here's the output for the first test sentence:
1 Scores _ NNS NNS _ 4 nsubj _ _
2 of _ IN IN _ 0 erased _ _
3 properties _ NNS NNS _ 1 prep_of _ _
4 are _ VBP VBP _ 0 root _ _
5 under _ IN IN _ 0 erased _ _
6 extreme _ JJ JJ _ 8 amod _ _
7 fire _ NN NN _ 8 nn _ _
8 threat _ NN NN _ 4 prep_under _ _
9 as _ IN IN _ 13 mark _ _
10 a _ DT DT _ 12 det _ _
11 huge _ JJ JJ _ 12 amod _ _
12 blaze _ NN NN _ 15 xsubj _ _
13 continues _ VBZ VBZ _ 4 advcl _ _
14 to _ TO TO _ 15 aux _ _
15 advance _ VB VB _ 13 xcomp _ _
16 through _ IN IN _ 0 erased _ _
17 Sydney _ NNP NNP _ 20 poss _ _
18 's _ POS POS _ 0 erased _ _
19 north-western _ JJ JJ _ 20 amod _ _
20 suburbs _ NNS NNS _ 15 prep_through _ _
21 . _ . . _ 4 punct _ _
I'm not sure you can do this through command line, but this is a java version:
for (List<HasWord> sentence : new DocumentPreprocessor(new StringReader(filename))) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
GrammaticalStructure.printDependencies(gs, gs.typedDependencies(), parse, true, false);
}
There is a conll2007 output, see the TreePrint documentation for all options.
Here is an example using the 3.8 version of the Stanford parser. It assumes an input file of one sentence per line, output in Stanford Dependencies (not Universal Dependencies), no propagation/collapsing, keep punctuation, and output in conll2007:
java -Xmx4g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -outputFormat conll2007 -originalDependencies -outputFormatOptions "basicDependencies,includePunctuationDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz input.txt
I was reading up on functional languages and i wondered how i would implement 'tries' in a pure functional language. So i decided to try to do it in F#
But i couldnt get half of the basics. I couldnt figure out how to use a random number, how to use return/continue (at first i thought i was doing a multi statement if wrong but it seems like i was doing it right) and i couldnt figure out how to print a number in F# so i did it in the C# way.
Harder problems is the out param in tryparse and i still unsure how i'll do implement tries without using a mutable variable. Maybe some of you guys can tell me how i might correctly implement this
C# code i had to do last week
using System;
namespace CS_Test
{
class Program
{
static void Main(string[] args)
{
var tries = 0;
var answer = new Random().Next(1, 100);
Console.WriteLine("Guess the number between 1 and 100");
while (true)
{
var v = Console.ReadLine();
if (v == "q")
{
Console.WriteLine("you have quit");
return;
}
int n;
var b = Int32.TryParse(v, out n);
if (b == false)
{
Console.WriteLine("This is not a number");
continue;
}
tries++;
if (n == answer)
{
Console.WriteLine("Correct! You win!");
break;
}
else if (n < answer)
Console.WriteLine("Guess higher");
else if (n > answer)
Console.WriteLine("Guess lower");
}
Console.WriteLine("You guess {0} times", tries);
Console.WriteLine("Press enter to exist");
Console.ReadLine();
}
}
}
The very broken and wrong F# code
open System;
let main() =
let tries = 0;
let answer = (new Random()).Next 1, 100
printfn "Guess the number between 1 and 100"
let dummyWhileTrue() =
let v = Console.ReadLine()
if v = "q" then
printfn ("you have quit")
//return
printfn "blah"
//let b = Int32.TryParse(v, out n)
let b = true;
let n = 3
if b = false then
printfn ("This is not a number")
//continue;
//tries++
(*
if n = answer then
printfn ("Correct! You win!")
//break;
elif n < answer then
printfn ("Guess higher")
elif n>answer then
printfn ("Guess lower")
*)
dummyWhileTrue()
(Console.WriteLine("You guess {0} times", tries))
printfn ("Press enter to exist")
Console.ReadLine()
main()
Welcome to F#!
Here's a working program; explanation follows below.
open System
let main() =
let answer = (new Random()).Next(1, 100)
printfn "Guess the number between 1 and 100"
let rec dummyWhileTrue(tries) =
let v = Console.ReadLine()
if v = "q" then
printfn "you have quit"
0
else
printfn "blah"
let mutable n = 0
let b = Int32.TryParse(v, &n)
if b = false then
printfn "This is not a number"
dummyWhileTrue(tries)
elif n = answer then
printfn "Correct! You win!"
tries
elif n < answer then
printfn "Guess higher"
dummyWhileTrue(tries+1)
else // n>answer
printfn "Guess lower"
dummyWhileTrue(tries+1)
let tries = dummyWhileTrue(1)
printfn "You guess %d times" tries
printfn "Press enter to exit"
Console.ReadLine() |> ignore
main()
A number of things...
If you're calling methods with multiple arguments (like Random.Next), use parens around the args (.Next(1,100)).
You seemed to be working on a recursive function (dummyWhileTrue) rather than a while loop; a while loop would work too, but I kept it your way. Note that there is no break or continue in F#, so you have to be a little more structured with the if stuff inside there.
I changed your Console.WriteLine to a printfn to show off how to call it with an argument.
I showed the way to call TryParse that is most like C#. Declare your variable first (make it mutable, since TryParse will be writing to that location), and then use &n as the argument (in this context, &n is like ref n or out n in C#). Alternatively, in F# you can do like so:
let b, n = Int32.TryParse(v)
where F# lets you omit trailing-out-parameters and instead returns their value at the end of a tuple; this is just a syntactic convenience.
Console.ReadLine returns a string, which you don't care about at the end of the program, so pipe it to the ignore function to discard the value (and get rid of the warning about the unused string value).
Here's my take, just for the fun:
open System
let main() =
let answer = (new Random()).Next(1, 100)
printfn "Guess the number between 1 and 100"
let rec TryLoop(tries) =
let doneWith(t) = t
let notDoneWith(s, t) = printfn s; TryLoop(t)
match Console.ReadLine() with
| "q" -> doneWith 0
| s ->
match Int32.TryParse(s) with
| true, v when v = answer -> doneWith(tries)
| true, v when v < answer -> notDoneWith("Guess higher", tries + 1)
| true, v when v > answer -> notDoneWith("Guess lower", tries + 1)
| _ -> notDoneWith("This is not a number", tries)
match TryLoop(1) with
| 0 -> printfn "You quit, loser!"
| tries -> printfn "Correct! You win!\nYou guessed %d times" tries
printfn "Hit enter to exit"
Console.ReadLine() |> ignore
main()
Things to note:
Pattern matching is prettier, more concise, and - I believe - more idiomatic than nested ifs
Used the tuple-return-style TryParse suggested by Brian
Renamed dummyWhileTrue to TryLoop, seemed more descriptive
Created two inner functions doneWith and notDoneWith, (for purely aesthetic reasons)
I lifted the main pattern match from Evaluate in #Huusom's solution but opted for a recursive loop and accumulator instead of #Hussom's (very cool) discriminate union and application of Seq.unfold for a very compact solution.
open System
let guessLoop answer =
let rec loop tries =
let guess = Console.ReadLine()
match Int32.TryParse(guess) with
| true, v when v < answer -> printfn "Guess higher." ; loop (tries+1)
| true, v when v > answer -> printfn "Guess lower." ; loop (tries+1)
| true, v -> printfn "You won." ; tries+1
| false, _ when guess = "q" -> printfn "You quit." ; tries
| false, _ -> printfn "Not a number." ; loop tries
loop 0
let main() =
printfn "Guess a number between 1 and 100."
printfn "You guessed %i times" (guessLoop ((Random()).Next(1, 100)))
Also for the fun of if:
open System
type Result =
| Match
| Higher
| Lower
| Quit
| NaN
let Evaluate answer guess =
match Int32.TryParse(guess) with
| true, v when v < answer -> Higher
| true, v when v > answer -> Lower
| true, v -> Match
| false, _ when guess = "q" -> Quit
| false, _ -> NaN
let Ask answer =
match Evaluate answer (Console.ReadLine()) with
| Match ->
printfn "You won."
None
| Higher ->
printfn "Guess higher."
Some (Higher, answer)
| Lower ->
printfn "Guess lower."
Some (Lower, answer)
| Quit ->
printfn "You quit."
None
| NaN ->
printfn "This is not a number."
Some (NaN, answer)
let main () =
printfn "Guess a number between 1 and 100."
let guesses = Seq.unfold Ask ((Random()).Next(1, 100))
printfn "You guessed %i times" (Seq.length guesses)
let _ = main()
I use an enumeration for state and Seq.unfold over input to find the result.