How to determine the best Lambda to fill in missing values with external data - time-series

I have precipitation data from real weather stations. Unfortunately the data has a lot of missing values. I downloaded satellite data and want to fill the missing data. My approach looks as follows:
How would i determine the best lambda? Preferably in R or Python. One way I can think of is using a linear regressive model in the form of and setting the intercept to 0.
Are there more sophisticated estimation methods that are still general (there might be some methods specifically designed for precipitation data, but I am more interested in the general mathematical approach)?
I looked for paper on similar cases.

Related

Deep learning - Find patterns combining images and bios data

I was wondering if is it possible combining images and some "bios" data for finding patterns. For example, if I want to know if a image is a cat or dog and I have:
Enough image data for train my model
Enough "bios" data like:
size of the animal
size of the tail
weight
height
Thanks!
Are you looking for a simple yes or no answer? In that case, yes. You are in complete control over building your models which includes what data you make them process and what predictions you get.
If you actually wanted to ask on how to do it, it will depend on specific datasets and application but one way to do it would be by having two models, one specialized for determining the output label (cat or dog) from the image - so perhaps some kind of a simple CNN. The other would process the text data and find patterns in that. Then at the end, you could have either a non-AI evaluator that would combine these two predictions into one naively or you could have both of these models as an input to a simple neural network that would learn pattern from the output of these two models.
That is just one way to possibly do it though and, as I said, the exact implementation will depend on a lot of other factors. How are both of the datasets labeled? Are the data connected to each other? Meaning that, for each picture, do you have some textual data that is for that specific image? Or do you jsut have a spearated dataset of pictures and separate dataset of biological information?
There is also the consideration that you'll probably want to make about the necessity of this approach. Current models can predict categories from processing images with super-human precision. Unless this is an excersise in creating a more complex model, this seems like an overkill.
PS: I wouldn't use term "bios" in this context, I believe it is not a very common usage and here on SO it will mostly confuse people into thinking you mean the actual BIOS.

Machine learning data modeling

I am a beginner in Machine learning. I have seen videos which teaches machine learning. But my questions is How can we model our data.
Mostly we get unstructured data. How can I convert that unstructured data into structured format, The BEST way. So that we can find the most useful information from the data.
Any help w.r.t books or links is very thankful.
As a machine learning engineer, You will be responsible for preprocessing your data in a way such that it will be acceptabele. by the model.
There is no best way to do this and moresoo, it depends on what type of data you have such as 1. csv datasets, 2. Text dataset, file(image & audio).
In the real world all the data will not be in a structured form. When we get the data very first thing is find
1. what is the data is all about.
2. what are the features of it and output of it.
Ex: Dataset to predict the height a person and you have all the below info like from which country, Weight, Gender, Hair color etc.. these are the features we say usually term in Machine learning.
3. Then we need to see how the data features are. Like text data or numerical etc.. We need to pre-process the data before we do any analysis of the data. For Ex: In case you data, a feature is all about a review then you need remove all the special function and corpous your data.
4. You need to understand the way model accepts the data and parameters the model has how can we improve the data.( We can do some feature engineering to improve the models etc..)
There is no hard and fast rule you need to do in the same way.
First, you need to learn about preprocessing and feature extraction. If you make a model in Python, then libraries like Pandas or Scikit learn are very useful. As a first step try to create sentences like "when x occurs then my output y becomes ...".
Before modeling, the data has to be cleaned. There are several methods to clean the data. Go through the link on how to convert data from unstructured data to structured data.
https://www.geeksforgeeks.org/how-to-convert-unstructured-data-to-structured-data-using-python/

Do you have any suggestions for a Machine Learning method that may actually learn to distinguish these two classes?

I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.

Detect hidden unknown patterns when visualization fails

I have a fast set of multi dimensional timebased data which i suspect contain patterns. I simplified the dataset to create a custom visualization.
Humans see patterns in the visualization but the result of the pattern cannot be explained by the visualization. This is because of the simplification step, it hides data which is important.
I cannot put all my data in my visualization cause than humans cannot see the possible patterns anymore because too much data and dimensions are visualized.
Is there a technique that can detect hidden unknown patterns in a data set? (without using visualization, and without me learning the technique patterns) .
One optional extra would be that the technique should somehow be able to "explain the patterns" to me so that i can check if they make sense.
[edit] i can give the technique a collection of small sized datasets (extracted from the big dataset; still very multi dimensional) that i know that contain patterns (by using my visualization). The technique then needs to analyze under what conditions a pattern produces result a or result b.
First of, how did you "simplify" the data? If you did it without any heuristics, you might go ahead and perform PCA. The very idea of PCA is to solve your problem: Not losing "important" data while having a dimensional reduction. You can visualize your principal components so that patterns can be detected by the human eye as well as algorithms.
To your 2nd question: Yes, there are techniques that can detect hidden unknown patterns in data. However, this is a huge field (Machine Learning) and what algorithm you'd use, would depend on your problem structure, so it's impossible to give a specific model name at this point. From what you specified, neural networks in general seem fit to do the job. After you trained a network, you can visualize the activations or weights (Hinton Diagram) to perform an analysis on which input data is treated "similarly".

How to include datetimes and other priority information for clustering?

I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck

Resources