I am trying to understand how to add entity classes to the Named Entity Recognizer. An example code has a structure that looks like:
ner = EntityRecognizer(nlp.vocab, entity_types=[... ENTITIES ...])
for itn in range(NUMBER_OF_ITERATIONS):
for raw_text, entities in training_examples:
... some data handling ...
ner.update(doc, gold)
, but then the next example (for BILUO tags) calls ner.update() only once (i.e., no for-loops that cause update() to see the training data multiple times).
I have read this question, whose answers seem to tell me I should call update() more than once for each training example; but then I also thought they could be just following the examples.
Because of the following sentence (from the end of the documentation page)...
The costs are then used to calculate the gradient of the loss, to train the model.
... I am guessing that the answer to my question is "yes, I should train it by iterating 'several' times through the training data"; but if that is the case, then does anyone have suggestion on how many times is "enough"? (the example code uses 5, but if I think it is too less, can I end up iterating "too many times"? I.e., does it "overfit"?)
I have been struggling with parallel and async constructs in F# for the last couple days and not sure where to go at this point. I have been programming with F# for about 4 months - certainly no expert - and I currently have a series of calculations that are implemented in F# (asp.net 4.5) and are working correctly when executed sequentially. I am running the calculations on a multi-core server and since there are millions of inputs to perform the same calculation on, I am hoping to take advantage of parallelism to speed it up.
The calculations are extremely data parallel - basically the exact calculation on different input data. I have tried a number of different avenues and I continually run into the same issue - it seems as if the parallel looping never gets to the end of the input data set. I have tried TPL, ConcurrentQueues, Parallel.Array.map/iter and all the same result: the program starts out fine and then somewhere in the middle (indeterminate) it just hangs and never completes. For simplicity I actually removed the calculation from the program and I am just calling a print method, and Here is where the code is currently at:
let runParallel =
let ids = query {for c in db.CustTable do select c.id} |> Seq.take(5)
let customerInputArray= getAllObservations ids
Array.Parallel.iter(fun c -> testParallel c) customerInputArray
let key = System.Console.ReadKey()
0
A few points...
I limited the results above to only 5 just for debugging. The actual program does not apply the Take(5).
The testParallel method is just a printfn "test".
The customerInputArray is a complex data type. It is a tuple of lists that contain records. So I am pretty sure my problem must be there...but I added exception handling and no exception is getting raised, so have no idea how to go about finding the problem.
Any help is appreciated. Thanks in advance.
EDIT: Thanks for the advice...I think it is definitely deadlock. When I remove all of the printfn, sprintfn, and string concat operations, it completes. (of course, I need those things in there.)
Is printfn, sprintfn, and string ops not thread-safe?
Another EDIT: Iteration always stops on the last item..So if my input array has 15 items, the processing stops on item 14, or seems to never get to item 15. Then everything just hangs. Does not matter what the size of the input array is..Any ideas what can be causing this? I even switched over to Parallel.ForEach (instead of Array.Parallel) and same behavior.
Update on the situation and how I resolved this issue.
I was unable to upload code from my example due to my company's firewall policy, so in the end my question did not have enough details. I failed to mention that I was using a type provider which was important information in this situation. But here is what I figured out.
I am using the F# type provider for SQL Server and was passing around its Service Types which I suspect are not thread-safe. When I replaced the ServiceTypes with plain old F# Records, the code worked fine - no more deadlocks and everything completed without error.
I am currently trying to do stream reasoning using Jena, so I want to be able to reason over a certain set of triples that have occurred in a particular window of time, also taking into account some background static knowledge.
My problem is that I have an ontology that I read from several files, however I wish for the triples I obtain to have time stamps for when I receive them, which I thought I could just do by applying labels to the triples (I am just giving them all random time stamps for the moment as this is only a test).
While I didn't think that this would be problem, I am struggling at the initial step of just applying a label to an existing triple and selecting it. I cannot not seem to be able to access triples from the ontModel without having to transform it into a Graph, and while I could then create quads with the extra value being some literal for time, I can't find a way to then reason over this graph.
Any light that people can shed on this issue would help. I hope I am being clear.
I'm not sure exactly how you're putting labels on your triples, but you can get Statements from an OntModel, and Statement implements FrontsTriple through which you can access a corresponding Triple.
given:
-record(foo, {a, b, c}).
I do something like this:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
From a semantic view, Thing and Thing1 are unique entities. However, from a language implementation standpoint, making a full copy of Thing to generate Thing1 would be intensely wasteful. For example, if the record were a megabyte in size and I made a thousand "copies," each modifying a couple of bytes, I've just burned a gigabyte. If the internal structure kept track of a representation of the parent structure and each derivative marked up that parent in a way that indicated its own change but preserved everyone elses' versions, the derivates could be created with a minimum of memory overhead.
My question is this: is erlang doing anything clever - internally - to keep the overhead of the usual erlang scribble;
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
Thing3 = make_modified_copy(Thing2),
Thing4 = make_modified_copy(Thing3),
Thing5 = make_modified_copy(Thing4)
...to a minimum?
I ask because there would be a number of changes to the way that I did cross-process communications if this were the case.
The exact workings of the garbage collection and memory allocation is only known to a few. Thankfully, they are very happy to share their knowledge and the following is based on what I have learnt from the erlang-questions mailing list and by discussing with OTP developers.
When messaging between processes, the content is always copied as there is no shared heap between processes. The only exception is binaries bigger than 64 bytes, where only a reference is copied.
When executing code in one process, only parts are updated. Let's analyze tuples, as that is the example you provided.
A tuple is actually a structure that keeps references to the actual data somewhere on the heap (except for small integers and maybe one more data type which I can't remember). When you update a tuple, using for example setelement/3, a new tuple is created with the given element replaced, however for all other elements only the reference is copied. There is one exception which I have never been able to take advantage of.
The garbage collector keeps track of each tuple and understands when it is safe to reclaim any tuple that is no longer used. It might be that the data referenced by the tuple is still in use, in which case the data itself is not collected.
As always, Erlang gives you some tools to understand exactly what is going on. The efficiency guide details how to use erts_debug:size/1 and erts_debug:flat_size/1 to understand the size of the data structure when used internally in a process and when copied. The trace tools also allows you to understand when, what and how much was garbage collected.
The record foo is of arity four (holding four words), but the whole structure is 14 words in size. Any immediate (pids, ports, small integers, atoms, catch and nil) can be stored directly in the tuples array. Any other term which can't fit into a word, such as other tuples, are not stored directly but referenced by boxed pointers (a boxed pointer is an erlang term with a forwarding address to the real eterm ... just internals).
In your case a new tuple of same arity is created and the atom foo and all the pointers are copied from the previous tuple except for index two, a, which points to the new tuple {7,8} which constitutes 3 words. In all 5 + 3 new words are created on the heap and only 3 words are copied from the old tuple the other 9 words are not touched.
Excessively large tuples are not recommended. When updating a tuple, the whole tuple, i.e the array and not the deep content, needs to copied and then updated in other to preserve a persistent data structure. This will also generate increased garbage, forcing the garbage collector to heat up which also hurts performance. The dict and array modules avoids using large tuples for this reason and have a shallow tuple tree instead.
I can definitely verify what people have already pointed out:
a record is just a tuple with the record name as the first element and all the fields just the following tuple element
when an element of a tuple is changed, updating a field in a record in your case, only the top level tuple is new, all the elements are just reused
This works just because we have immutable data. So in your example each time you update a value in a #foo record none of the data in the elements are copied and only a new 4-element tuple (5 words) is created. Erlang will never does a deep copy in this type of operation or when passing arguments in function calls.
In conclusion:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
Here, if Thing is not used again, it will probably be updated in place and copying of the tuple will be avoided, as the Efficiency Guide says. (tuple and record syntax is complied into something like setelement, I think)
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
...
Here the tuples are actually copied every time.
I guess that it would be theoretically possible make an interesting optimization to this. If the compiler could perform escape analysis on the return value of make_modified_copy and detect that the only reference to it is the one returned, in could save this information about the function. When it encounter a call the that function it would know that it is safe to modify the return value in place.
This would only be possible to do on inter module calls, because of the code replace feature.
Maybe one day we will have it.
What is the more efficient way?
FUserRecords[I].CurrentInput:=FUserRecords[I].CurrentInput+typedWords;
or
var userRec: TUserRec;
...
userRec:=FUserRecords[I];
userRec.CurrentInput:=userRec.CurrentInput+typedWords;
FUserRecords[I]:=userRec;
In the case you have described, the first example will be the most efficient.
Records are value types, so when you do this line:
userRec:=FUserRecords[I];
You are in fact copying the contents of the record in the array into the local variable. And when you do the reverse, you are copying the information again. If you are iterating over the array, this can end up being quite slow if your array and the records are quite large.
If you want to go down the second path, to speed things up you can manipulate the record in the array directly using a pointer to the record, like this:
type
TUserRec = record
CurrentInput: string;
end;
PUserRec = ^TUserRec;
var
LUserRec: PUserRec;
...
LUserRec := #FUserRecords[I];
LUserRec^.CurrentInput := LUserRec^.CurrentInput + typedWords;
(as discussed in the comments, the carets (^) are optional. I only added them here for completeness).
Of course I should point out that you really only need to do this if there is in fact a performance problem. It is always good to profile your application before you hand code these sorts of optimisations.
EDIT: One caveat to all the discussion on the performance issue you are looking at is that if most of your data in the records are strings, then most of any performance lost in the example you have shown will be in the string concatenation, not in the record manipulation.
Also, the strings are basically stored in the record as pointers, so the record will actually be the size of any integers etc, plus the size of the pointers for the strings. So, when you concatenate to the string, it will no increase the record size. You can basically look at the string manipulation part of your code as a separate issue to the record manipulation.
You have mentioned in your comments below, this is for storing input from multiple keyboards, I can't imagine that you will have more than 5-10 items in the array. At that size, doing these steps to optimize the code is not really going to boost the speed that much.
I believe that both your first example and the pointer-using code above will end up with approximately the same performance. You should use which ever is the easiest for you to read and understand.
N#
Faster and more readable:
with FUserRecords[I] do
CurrentInput := CurrentInput + typedWords;