I'm studying deep-learning and tensorboard, almost example code use summaries.
I wonder that why I need to use Variables summaries.
Their are a many type of data for summary like min, max, mean, variation, etc.
What should I use in a typical situation?
How to analyze and What can i get from these summary graph?
thank you :D
There is an awesome video tutorial (https://www.youtube.com/watch?v=eBbEDRsCmv4) on Tensorboard that describes almost everything about Tensorboard (Graph, Summaries etc.)
Variable summaries (scalar, histogram, image, text, etc) help track your model through the learning process. For example, tf.summary.scalar('v_loss', validation_loss) will add one point to the loss curve each time you call the summary op, thus give you a rough idea whether the model has converged and when to stop.
It depends on your variable type. For values like loss, tf.summary.scalar shows the trend across epochs; for variables like weights in a layer, it would be better to use tf.summary.histogram, which shows the change of entire distribution of weights; I typically use tf.summary.image and tf.summary.text to check the images / texts my model generates over different epochs.
The graph shows your model structure and the size of tensors flowing through each op. I found it hard at the beginning to organise ops nicely in the graph presentation, and I learnt a lot about variable scope from that. The other answer provides a link for a great tutorial for beginners.
Related
Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782
I have class which has slightly different features from the other class:
ex - This image has buckle in it (consider it as a class) https://6c819239693cc4960b69-cc9b957bf963b53239339d3141093094.ssl.cf3.rackcdn.com/1000006329245-822018-Black-Black-1000006329245-822018_01-345.jpg
But This image is quite similar to it but has no buckle :
https://sc01.alicdn.com/kf/HTB1ASpYSVXXXXbdXpXXq6xXFXXXR/latest-modern-classic-chappal-slippers-for-men.jpg
I am little confused about which model to use in these kind of cases which actually learns pixel to pixel values.
Any thoughts will be appreciable.
thanks !!
I have already tried Inception,Resnet etc models.
With a less volume train data (300-400 around each class) can we reach a good recall/precision/F1 score.
You might want to look into transfer learning due to the small dataset, what you can do is use a transferred ResNet model to work as a feature extractor and try a YOLO(You only look once) algorithm on it, look through each window(Look Sliding window implementation using ConvNets) to obtain a belt buckle and based on that you can classify the image.
Based on my understanding of your dataset, to do the above approach though you will need to re-annotate your dataset as per the requirements of YOLO algorithm.
To look at an example of the above approach, visit https://mc.ai/implementing-yolo-using-resnet-as-feature-extractor/
Edit If you have XML annotated Dataset and need to convert it to csv to follow the above example use https://github.com/datitran/raccoon_dataset
Happy modelling.
I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)
I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.
Consider the text classification problem of spam or not spam with the Naive Bayes algorithm.
The question is the following:
how do you make predictions about a document W = if in that set of words you see a new word wordX that was not seen at all by your model (so you do not even have a laplace smoothing probabilty estimated for it)?
Is the usual thing to do is just ignore that wordX eventhough it was seen in the current text because it has no probability associated with? I.e. I know sometimes the laplace smoothing is used to try to solve this problem, but what if that word is definitively new?
Some of the solutions that I've thought of:
1) Just ignore that words in estimating a classification (most simple, but sometimes wrong...?, however, if the training set is large enough, this is probably the best thing to do, as I think its reasonable to assume your features and stuff were selected well enough if you have say 1M or 20M data).
2) Add that word to your model and change your model completely, because the vocabulary changed so probabilities have to change everywhere (this does have a problem though since it could mean that you have to update the model frequently, specially if your analysis 1M documents, say)
I've done some research on this, read some of the Dan Jurafsky NLP and NB slides and watched some videos on coursera and looked through some research papers but I was not able to find something I found useful. It feels to me this problem is not new at all and there should be something (a heuristic..?) out there. If there isn't, it would be awesome to know that too!
Hope this is a useful post for the community and Thanks in advance.
PS: to make the issue a little more explicit with one of the solutions I've seen is, say that we see an unknown new word wordX in a spam, then for that word we can do 1/ count(spams) + |Vocabulary + 1|, the issue I have with doing something like that is that, then, does that mean we change the size of the vocabulary and now, every new document we classify, has a new feature and vocabulary word? This video seems to attempt to solve that issue but I'm not sure if either, thats a good thing to do or 2, maybe I have misunderstood it:
https://class.coursera.org/nlp/lecture/26
From a practical perspective (keeping in mind this is not all you're asking), I'd suggest the following framework:
Train a model using an initial train set, and start using it for classificaion
Whenever a new word (with respect to your current model) appears, use some smoothing method to account for it. e.g. Laplace smoothing, as suggested in the question, might be a good start.
Periodically retrain your model using new data (usually in addition to the original train set), to account for changes in the problem domain, e.g. new terms. This can be done on preset intervals, e.g once a month; after some number of unknown words was encountered, or in an online manner, i.e. after each input document.
This retrain step can be done manually, e.g. collect all documents containing unknown terms, manually label them, and retrain; or using semi-supervised learning methods, e.g. automatically add the highest scored spam/ non spam documents to the respective models.
This will ensure your model stays updated and accounts for new terms - by adding them to the model from time to time, and by accounting for them even before that (simply ignoring them is usually not a good idea).