Similarity between LDA results over two different number of topics? - machine-learning

if we choose 20 topics in LDA and then if we choose 30 topics. So my question is will both these results intersect those 20 topics and produce similar results

Short answer - no. The way LDA works is it uses Gibbs sampler to get Dirichlet distribution over document vectors. Allocations are then made on this sample and hence will always be different both because of sampling randomness and allocation uncertainties unless you define explicit random seed and run same number of topics k. Take a look at original paper Blei et al. 2003 to see how k is defined.
UPDATE (with regard to comment): Hierarchical LDA (hLDA) is trying to solve the problem of retaining topics and subtopics by constructing levels of topics following the Chinese restaurant model. But it's still not perfect.
The way flat LDA works, however, is it looks at documents rather than topics to produce further results. Say, you get topic 0 (first table in restaurant) and all documents try to sit there, but it's not really enough space and you create another topic 1 where some docs feel more comfortable, etc., etc. now you are right from the point of view of how these tables are created. But there is one big thing that's critical - topic 0 CHANGES when you create a new table/Topic 1 because some documents have left the first table and took the words (or probabilities of cooccurence thereof) with them to the new table and all words in topic 0 got reshuffled given new situation. Same happens when you create more tables/topics that all the previous are also re-estimated. Hence, you will never get same 20 topics when rerunning with 30.

Related

Build vocab in doc2vec

I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?
lst_doc = doc.translate(str.maketrans('', '', string.punctuation))
target_data = word_tokenize(lst_doc)
train_data = list(read_data())
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
train_vocab = model.build_vocab(train_data)
print(train_vocab)
{train = model.train(train_data, total_examples=model.corpus_count,
epochs=model.epochs) }
Output:
None
A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.
So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.
If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.
You can also examine the state of the model and its various internal properties after the code has run.
For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).
Other comments:
It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.
Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.
only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.

Applying MACHINE learning in biological text data

I am trying to solve the following question - Given a text file containing a bunch of biological information, find out the one gene which is {up/down}regulated. Now, for this I have many such (60K) files and have annotated some (1000) of them as to which gene is {up/down}regulated.
Conditions -
Many sentences in the file have some gene name mention and some of them also have neighboring text that can help one decide if this is indeed the gene being modulated.
Some files also have NO gene modulated. But these still have gene mentions.
Given this, I wanted to ask (having absolutely no background in ML), what sequence learning algorithm/tool do I use that can take in my annotated (training) data (after probably converting the text to vectors somehow!) and can build a good model on which I can then test more files?
Example data -
Title: Assessment of Thermotolerance in preshocked hsp70(-/-) and
(+/+) cells
Organism: Mus musculus
Experiment type: Expression profiling by array
Summary: From preliminary experiments, HSP70 deficient MEF cells display moderate thermotolerance to a severe heatshock of 45.5 degrees after a mild preshock at 43 degrees, even in the absence of hsp70 protein. We would like to determine which genes in these cells are being activated to account for this thermotolerance. AQP has also been reported to be important.
Keywords: thermal stress, heat shock response, knockout, cell culture, hsp70
Overall design: Two cell lines are analyzed - hsp70 knockout and hsp70 rescue cells. 6 microarrays from the (-/-)knockout cells are analyzed (3 Pretreated vs 3 unheated controls). For the (+/+) rescue cells, 4 microarrays are used (2 pretreated and 2 unheated controls). Cells were plated at 3k/well in a 96 well plate, covered with a gas permeable sealer and heat shocked at 43degrees for 30 minutes at the 20 hr time point. The RNA was harvested at 3hrs after heat treatment
Here my main gene is hsp70 and it is down-regulated (deducible from hsp(-/-) or HSP70 deficient). Many other gene names are also there like AQP.
There could be another file with no gene modified at all. In fact, more files have no actual gene modulation than those who do, and all contain gene name mentions.
Any idea would be great!!
If you have no background in ML I suggest buying a product like this one, this one or this one. These products where in development for decades with team budgets in millions.
What you are trying to do is not that simple. For example a lot of papers contain negative statements by first citing the original statement from another paper and then negating it. In your example how are you going to handle this:
AQP has also been reported to be important by Doe et al. However, this study suggest that this might not be the case.
Also, if you are looking into large corpus of biomedical research papers, or for this matter any corpus of research papers. You will find tons of papers that suggest something for example gene being up-regulated or not, and then there is one paper published in Cell magazine that all previous research has been mistaken.
To make matters worse, gene/protein names are not that stable. Besides few famous ones like P53. There is a bunch of run of the mill ones that are initially thought that they are one gene, but later it turns out that these are two different things. When this happen there are two ways community handles it. Either both of the genes get new names (usually with some designator at the end) or if the split is uneven the larger class retains original name and the second one gets the new name. To compound this problem, after this split happens not all researchers get the memo at instantly, so there is still stream of publications using old publication.
These are just two simple problems, there are 100s of these.
If you are doing this for personal enrichment. Here are some suggestions:
Build a language model on biomedical papers. Existing language models are usually built from news-wire sources or from social media data. All three of the corpora claim to be written in English language. But in reality these are three different languages with their own grammar and vocabulary
Look into things like embeddings and word2vec.
Look into Kaggle competitions, this is somewhat popular topic there.
Subscribe to KDD and BIBM magazines or find them in nearby library. There are 100s of papers on this subject.

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

How to quantify these features so they can be analysed upon using Logistic Regression?

I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:
Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?
The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.
Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Resources