How to cluster sentences based on same word? - machine-learning

Here is my sample data :
maybe add a higher-level description
min of spare daemons
data in the appropriate order
the compiled max daemons
an iovec to store the trailer sent after the file
data in the wrong order
an iovec to store the headers sent before the file
return err maybe add a higher-level desc
if a user manually creates a data file
I want to conduct a cluster approach and automatically put these data into categories based
on same word appear in the sentence, so what I am trying to achieve is like this:
add
maybe add a higher-level description
return err maybe add a higher-level desc
damons
min of spare daemons
the compiled max daemons
iovec
an iovec to store the headers sent before the file
an iovec to store the trailer sent after the file
data
data in the wrong order
data in the appropriate order
if a user manually creates a data file
Could anyone give me some help? Thank you a lot!

Sounds as if you want to find the most frequent words?
Not really hard to do (and not "clustering", just counting and grouping by the frequent word), what have you tried, where are you stuck?

I think that what you're specifically looking for is a minimal covering. Each sentence could be "covered" by any of the words in it, and you want a set of words that will cover each sentence at least once, correct?
You can read specifically about this kind of problem at https://en.wikipedia.org/wiki/Set_cover_problem -- and in fact, to do so perfectly in NP Complete.
One way would be a simple greedy algorithm, looking for a word that covers the most sentences (most frequent word in the set, no double-counting sentences), then taking that group and moving on to whatever's left.
There are plenty of cases where this is far from optimal, especially if you want the group to be of similar size. You might actually want to throw out words that cover too many -- depending on the set, for instance, "program" could appear in many items without being particularly relevant to most of them.
In this case, it becomes a problem of finding what's relevant. Maybe it would make sense to have some sort of parameter, A, for the number of groups, and then you program can aim for words that each give roughly N/A sentences? Count word frequencies, look for that point (a frequency of N/A), and then slowly add words until everything is covered. And then at the end a post-phase in order to try to combine subsets of a common set in order to make it overall cleaner.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

Best way to store processed text data for streaming to gensim?

I've got several hundred pandas data frames, each of which has a column of very long strings that need to be processed/sentencized and finally tokenized before modeling with word2vec.
I can store them in any format on the disk, before I build a stream to pass them to gensim's word2vec function.
What format would be best, and why? The most important criterion would be performance vis-a-vis training (which will take many days), but coherent structure to the filesystem would also be nice.
Would it be crazy to store several million or maybe even a few billion text files containing one sentence each? Or perhaps some sort of database? If this was numerical data I'd use hdf5. But it's text. The cleanest would be to store them in the original data frames, but that seems less ideal from an i/o perspective, because I'd have to load each data frame (largish) every epoch.
What makes the most sense here?
As you do your preprocessing/tokenization of all the source data that you want to be part of a single training session, append the results to a single plain-text file.
Use space-separated words, and end each 'sentence' (or any other useful text-chunk that's less than 10,000 words long) with a newline.
Then you can use the corpus_file option for specifying your pre-tokenized training data, and will get the maximum possible multithreading benefit. (That mode will direct each thread to open its own view into a range of the single file, so there's no blocking on any distributor thread.)

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}
Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.
Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

How to count occurrences of a substring within string fast with Ruby

I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?
Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?
I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.
You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.

Resources