TFF: How create a Non-IID dataset - tensorflow-federated

I have 2 classes and every class has 140 examples, and I have 4 clients, I would like to create a non-iid dataset like the paper of McMahan, how divide examples into fragments ?

Note: there are many notions of "non-iid-ness" that may be interesting to explore.
Label non-iid: you might want to make the distribution of labels very unbalanced across clients. Evenly distributing the number of examples, we can still get non-iid distribution such as [(35, 35), (10, 60), (50, 20), (45, 25)]. The McMahan 2016 paper takes a similar approach, but takes a problem with 10 classes and gives most clients only two classes (the exact method is on Page 5 of the paper).
Amount of data: you might want to give some clients more data than others. With 280 examples, perhaps the split is (180, 80, 10, 10) examples (ignores how the labels are distributed). The StackOverflow dataset in TensorFlow Federated also exhibits this, as some cleints have tens of thousands of examples, while others only have 100.
Feature non-iid: If there are patterns in the feature space, it maybe useful to restrict certain patterns to certain users. For example in an image recognition task, perhaps some camera had a different white balance, rotation, or color saturation than others (even though they have most or all labels). Instead of shuffling these randomly across synthetic clients, grouping the similar feature patterns into a single client can give a different form of non-iid.

Related

Is splitting a long document of a dataset for BERT considered bad practice?

I am fine-tuning a BERT model on a labeled dataset with many documents longer than the 512 token limit set by the tokenizer.
Since truncating would lose a lot of data I would rather use, I started looking for a workaround. However I noticed that simply splitting the documents after 512 tokens (or another heuristic) and creating new entries in the dataset with the same label was never mentioned.
In this answer, someone mentioned that you would need to recombine the predictions, is that necessary when splitting the documents?
Is this generally considered bad practice or does it mess with the integrity of the results?
You have not mentioned if your intention is to classify, but given that you refer to an article on classification I will refer to an approach where you classify the whole text.
The main question is - which part of the text is the most informative for your purpose - or - in other words - does it make sense to use more than the first / last split of text?
When considering long passages of text, frequently, it is enough to consider the first (or last) 512 tokens to correctly predict the class in substantial majority of cases (say 90%). Even though you may loose some precision, you gain on speed and performance of the overall solution and you are getting rid of a nasty problem of figuring out the correct class out of a set of classifications. Why?
Consider an example of text 2100 tokens long. You split it by 512 tokens, obtaining pieces: 512, 512, 512, 512, 52 (notice the small last piece - should you even consider it?). Your target class for this text is, say, A, however you get the following predictions on the pieces: A, B, A, B, C. So you have now a headache to figure out the right method to determine the class. You can:
use majority voting but it is not conclusive here.
weight the predictions by the length of the piece. Again non conclusive.
check that prediction of the last piece is class C but it is barely above the threshold and class C is kinda A. So you are leaning towards A.
re-classify starting the split from the end. In the same order as before you get: A, B, C, A, A. So, clearly A. You also get it when you majority vote combining all of the classifications (forward and backward splits).
consider the confidence of the classifications, e.g. A: 80, B: 70, A: 90, B: 60, C: 55% - avg. 85% for A vs. 65% for B.
reconfirm the correction of labelling of the last piece manually: if it turns out to be B, then it changes all of the above.
then you can train an additional network to classify out of the raw classifications of pieces. Getting again into trouble of figuring out what to do with particularly long sequences or non-conclusive combinations of predictions resulting in poor confidence of the additional classification layer.
It turns out that there is no easy way. And you will notice that text is a strange classification material exhibiting all of the above (and more) issues while typically the difference in agreement between the first piece prediction and the annotation vs. the ultimate, perfect classifier is slim at best.
So, spare the effort and strive for simplicity, performance, and heuristic... and clip it!
On details of the best practices you should probably refer to the article from this answer.

Find points to split n-dimensional time series based on labeled data

I'm at the starting point of a bigger project to identify common patterns in time series.
The goal is to automatically find split points in time series which splits the series into commonly used patterns. (Later I want to split the time series based on the split points to use the time series in between independently.)
One time series consists of:
n data series based on a fix time interval as input
The x-axis represents the interval indices from 0 to m
The y-axis represents the values of the specific time series
For example, it could look like this:
pos_x,pos_y,pos_z,force_x,force_y,force_z,speed,is_split_point
2, 3, 4, 0.4232, 0.4432, 0, 0.6, false
2, 3, 4, 0.4232, 0.4432, 0, 0.6, false
2, 3, 4, 0.4232, 0.4432, 0, 0.6, true
My best bet is to solve this problem with Machine Learning because I need a general approach to detect the patterns based on the user selection beforehand.
Therefore I have a lot of labeled data where the split points are already set manually by the user.
Currently, I have two ideas to solve this problem:
Analyzing the data around the split points in the labeled data to derive a correlation between the different data dimensions and use this as new features on unlabeled data.
Analyzing the patterns between two keyframes to find similar patterns in unlabeled data.
I prefer 1. because I think it's more important to find out what defines a split point.
I'm curious about if neuronal networks are well suited for this task?
I ask the question not to get a solution for the problem, I just want to get a second opinion on this. I'm relatively new to Machine Learning, that's why it's a bit overwhelming to find a good starting point for this problem. I'm very happy with any ideas, techniques and useful resources which could cover this problem and can give me a good starting point.
Whooo that is a great queastion.
In matter of fact, I also have some ideas to give to you, some of them were tested and worked on different time-based problems with anomaly events I encountered.
First, analyzing the data is always a great approach for better understanding of the problem, regardless of what solution you will use. This way you ensure that you don't feed you models garbage. Tools for this analysis can be peaks in a truncated past window, derivatives, etc.
Then you can draw the data using t-sne and see if there is some kind of separation in the data.
However, simply using neural networks can be problematic since you have small number of split points and large number of non-split points.
You can use LSTMs and train them in a many-to-one configuration, where you create balanced number of positive and negative examples. The LSTMs will help you to overcome the varying length of the examples, and give more meaning to the time domain.
Going into that direction you can use truncated past and make it as an example with the is_split_point as the label, and use an i.i.d model by pulling samples in a balanced way. DNNS also works in that configuration.
All the above are experimented approaches that I found useful.
I hope it helps. GOOD LUCK!

Training GAN on small dataset of images

I have created a DCGAN and already trained it for CIFAR-10 dataset. Now, i would like to train it for custom dataset.
I have already gathered around 1200 images, it is practicly impossible to gather more. What should i do?
We are going to post a paper in a coming week(s) about stochastic deconvolutions for generator, that can improve stability and variety for such a problem. If you are interested, I can send a current version of a paper right now. But generally speaking, the idea is simple:
Build a classic GAN
For deep layers of generator (let's say for a half of them) use stochastic deconvolutions (sdeconv)
sdeconv is just a normal deconv layer, but filters are being selected on a fly randomly from a bank of filters. So your filter bank shape can be, for instance, (16, 128, 3, 3) where 16 - number of banks, 128 - number of filters in each, 3x3 - size. Your selection of a filter set at each training step is [random uniform 0-16, :, :, :]. Unselected filters remain untrained. In tensorflow you want to select different filter sets for a different images in batch as well as tf keeps training variables even if it is not asked for (we believe it is a bug, tf uses last known gradients for all variables even if they are not being used in a current dynamic sub-graph, so you have to utilize as much variables as you can).
That's it. Having 3 layers with sdeconv of 16 sets in each bank, practically you'll have 16x16x16 = 4096 combinations of different internal routes to produce an output.
How is it helping on a small dataset? - Usually small datasets have relative large "topics" variance, but generally dataset is of one nature (photos of cats: all are realistc photos, but with different types of cats). In such datasets GAN collapses very quickly, however with sdeconv:
Upper normal deconv layers learns how to reconstruct a style "realistic photo"
Lower sdevond learns sub-distributions: "dark cat", "white cat", "red cat" and so on.
Model can be seen as ensemble of weak-generators, each sub-generator is weak and can collapse, but will be "supported" by another sub-generator that temorarily outperforms discriminator.
MNIST is a great example of such a dataset: high "topics" variance, but the same style of digits.
GAN+weight norm+prelu (collapsed after 1000 steps, died after 2000, can only describe one "topic"):
GAN+weight norm+prelu+sdeconv, 4388 steps (local variety degradation of sub-topics is seen, however not collapsed globally, global visual variety preserved):

Improving Article Classifier Accuracy

I've built an article classifier based on Wikipedia data that I fetch, which comes from 5 total classifications.
They are:
Finance (15 articles) [1,0,0,0,0]
Sports (15 articles) [0,1,0,0,0]
Politics (15 articles) [0,0,1,0,0]
Science (15 articles) [0,0,0,1,0]
None (15 random articles not pertaining to the others) [0,0,0,0,1]
I went to wikipedia and grabbed about 15 pretty lengthy articles from each of these categories to build my corpus that I could use to train my network.
After building a lexicon of about 1000 words gathered from all of the articles, I converted each article to a word vector, along with the correct classifier label.
The word vector is a hot array, while the label is a one hot array.
For example, here is the representation of one article:
[
[0,0,0,1,0,0,0,1,0,0,... > 1000], [1,0,0,0] # this maps to Finance
]
So, in essence, I have this randomized list of word vectors mapped to their correct classifiers.
My network is a 3 layer, deep neural net that contains 500 nodes on each layer. I pass through the network over 30 epochs, and then just display how accurate my model is at the end.
Right now, Im getting about 53% to 55% accuracy. My question is, what can I do to get this up into the 90's? Is it even possible, or am I going to go crazy trying to train this thing?
Perhaps additionally, what is my main bottleneck so to speak?
edited per comments below
Neural networks aren't really designed to run best on single machines, they work much better if you have a cluster, or at least a production-grade machine. It's very common to eliminate the "long tail" of a corpus - if a term only appears in one document one time, then you may want to eliminate it. You may also want to apply some stemming so that you don't capture multiples of the same word. I strongly advise to you try applying TFIDF transformation to your corpus before pruning.
Network size optimization is a field unto itself. Basically, you try adding more/less nodes and see where that gets you. See the following for a technical discussion.
https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
It is impossible to know without seeing the data.
Things to try:
Transform your word vector to TFIDF. Are you removing stop words? You can add bi-grams/tri-grams to your word vector.
Add more articles - it could be difficult to separate them in such a small corpus. The length of a specific document doesn't necessarily help, you want to have more articles.
30 epochs feels very low to me.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources