spaCy: optimizing tokenization - machine-learning

I'm currently trying to tokenize a text file where each line is the body text of a tweet:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"#Good2go #krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"#Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
The file is 59,397 lines long (a day's worth of data) and I'm using spaCy for pre-processing/tokenization. It's currently taking me around 8.5 minutes and I was wondering if there were any way of optimising the following code to be quicker as 8.5 minutes seems awfully long for this process:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
Although it says files, it's currently only looping over 1 file.
Just to note, I only need this to tokenize the content; I don't need any extra tagging etc.

It sounds like you haven't optimised the pipeline yet. You'll get a significant speed up from disabling the pipeline components you don't need, like so:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
This should get you down to about the two-minute mark, or better, on its own.
If you need a further speed up, you can look at multi-threading using nlp.pipe. Docs for multi-threading are here:
https://spacy.io/usage/processing-pipelines#section-multithreading

You can use nlp.pipe(all_lines) instead of nlp(line) for a faster processing
see Spacy's documentation - https://spacy.io/usage/processing-pipelines

Related

`knitr_out, `file_out` and `vis_drake_graph` usage in R:drake

I'm trying to understand how to use knitr_out, file_out and vis_drake_graph properly in drake.
I have two questions.
Q1: Usage of knitr_out and file_out to create markdown reports
While a code like this works correctly for one of my smaller projects:
make_hyp_data_aggregated_report <- function() {
render(
input = knitr_in("rmd/hyptest-is-data-being-aggregated.Rmd"),
output_file = file_out("~/projectname/reports/01-hyp-test.html"),
quiet = TRUE
)
}
plan <- drake_plan(
...
...
hyp_data_aggregated_report = make_hyp_data_aggregated_report()
...
...
)
Exactly similar code in my large project (with ~10+ reports) doesn't work exactly right. i.e., while the reports get built, the knitr_in objects don't get displayed as the blue squares in the graph using drake::vis_drake_graph() in my large project.
Both projects use the drake::loadd(....) within the markdown to get the objects from cache.
Is there some code in vis_drake_graph that removes these squares once the graph gets busy?
Q2: file_out objects in vis_drake_graph
Is there a way to display the file_out objects themselves as circles/squares in vis_drake_graph?
Q3: packages showing up in vis_drake_graph
Is there a way to avoid vis_drake_graph from printing the packages explicitly? (Basically anything with the ::)
Q1
Every literal file path needs its own knitr_in() or file_out(). If you have one function with one knitr_in(), even if you use the function multiple times, that still only counts as one file path. I recommend writing these keywords at the plan level, e.g.
plan <- drake_plan(
r1 = render(knitr_in("report1.Rmd"), output_file = file_out("report1.html")),
r2 = render(knitr_in("report2.Rmd"), output_file = file_out("report2.html")),
r3 = render(knitr_in("report3.Rmd"), output_file = file_out("report3.html"))
)
Q2
They should appear unless you set show_output_files = FALSE in vis_drake_graph().
Q3
No, but if it's any consolation, I do regret the decision to track namespaced functions and objects at all in drake. drake's approach is fundamentally suboptimal for tracking packages, and I plan to get rid of it if there ever comes time for a round of breaking changes. Otherwise, there is no way to get rid of it except vis_drake_graph(targets_only = TRUE), which also gets rid of all the imports in the graph.

Lua: Working with the Modbus TCP/IP Protocol

This question is based off a previous question I asked concerning a similar topic: Lua: Working with Bit32 Library to Change States of I/O's . I'm trying to use a Lua program that, when a PLC changes the state of a coil at a given address (only two addresses will be used) then it triggers a reaction in another piece of equipment. I have some code that is basically the exact same as my previous topic. But this has to do with what this code is actually doing and not so much the bit32 library. Usually I run code I don't in understand in my Linux IDE and slowly make changes until I finally can make sense of it. But this is producing some weird reactions that I can't make sense of.
Code example:
local unitId = 1
local funcCodes = {
readCoil = 1,
readInput = 2,
readHoldingReg = 3,
readInputReg = 4,
writeCoil = 5,
presetSingleReg = 6,
writeMultipleCoils = 15,
presetMultipleReg = 16
}
local function toTwoByte(value)
return string.char(value / 255, value % 255)
end
local coil = 1
local function readCoil(s, coil)
local req = toTwoByte(0) .. toTwoByte(0) .. toTwoByte(6) .. string.char(unitId, funcCodes.readCoil) .. toTwoByte(coil - 1) .. toTwoByte(1)
s:write(req) --(s is the address of the I/O module)
local res = s:read(10)
return res:byte(10) == 1 -- returns true or false if the 10th bit is ==1 I think??? Please confirm
end
The line that sets local req is the part I'm truly not making sense of. Because of my earlier post, I understand fully about the toTwoByte function and was quickly refreshed on bits & byte manipulation (truly excellent by the way). But that particular string is the reason for this confusion. If I run this in the demo at lua.org I get back an error "lua number has no integer representation". If I separate it into the following I am given back ascii characters that represent those numbers (which I know string.char returns the ascii representation of a given digit). If I run this in my Linux IDE, it displays a bunch of boxes, each containing four digits; two on top of the other two. Now it is very hard to distinguish all of the boxes and their content as they are overlapping.
I know that there is a modbus library that I may be able to use. But I would much rather prefer to understand this as I'm fairly new to programming in general.
Why do I receive different returned results from Windows vs Linux?
What would that string "local req" look like when built at this point to the I/O module. And I don't understand how this req variable translates into the proper string that contains all of the information used to read/write to a given coil or register.
If anyone needs better examples or has further questions that I need to answer, please let me know.
Cheers!
ETA: This is with the Modbus TCP/IP Protocol, not RTU. Sorry.

Reading and parsing a large .dat file

I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed.
This is what it looks like:
These are the first two lines with header:
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
This is an option to read data faster in R by using the fread function in the data.table package.
EDIT
I removed all <br/> new-line tags. This is the edited dataset
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167|
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
Then I matched variables with classes. You should use nrows ~ 100.
colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
Then I read the edited data.
your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
colClasses=colclasses)
Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):
original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
with open(original_file) as inp:
with open(edited_file, "w") as op:
for line in inp:
op.write(line.replace("<br/>", "")
P.S.
You can use read.table with similar optimizations, but it won't give you nearly as much speed.

Why is this RegExp taking 16 minutes to process on Rails?

I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+#[[:graph:]]+/i,
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
]
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
end
end
end
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128}#[-_.\w]{1,128}/i,
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
]
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
end
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?

Erlang: What is most-wrong with this trie implementation?

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Resources