Mosek behavior: What is infeasibility certificate formulae of Mosek? - cvxpy

In Mosek's infeasibility white paper it mentions two formulae for infeasibility extraction, i.e.
Without tolerance:
With tolerance:
Among these, which formulae is used by Mosek?
If its the 2nd choice, how do we set the value of the Є tolerance?

Related

Problems plotting time-series interactively with Altair

Description of the problem
My goal is quite basic: to plot time series in an interactive plot. After some research I decided to give a try to Altair.
There are already QGIS plugins for time-series visualisation, but as far as I'm aware, none for plotting time-series at vector-level, interactively clicking on a map and selecting a Polygon. So that's why I decided to go for a self-made solution using Altair, maybe combining it with Folium to add functionalities later on.
I'm totally new to the Altair library (as well as Vega and Vega-lite), and quite new in datascience and data visualisation as well... so apologies in advance for my ignorance!
There are already well explained tutorials on how to plot time series with Altair (for example here, or in the official website). However, my study case has some particularities that, as far as I've seen, have not yet been approached altogether.
The data is produced using the Python API for Google Earth Engine and preprocessed with Python and the pandas/geopandas libraries:
In Google Earth Engine, a vegetation index (NDVI in the current case) is computed at pixel-level for a certain region of interest (ROI). Then the function image.reduceRegions() is mapped across the ImageCollection to compute the mean of the ndvi in every polygon of a FeatureCollection element, which represent agricultural parcels. The resulting vector file is exported.
Under a Jupyter-lab environment, the data is loaded into a geopandas GeoDataFrame object and preprocessed, transposing the DataFrame and creating a datetime column, among others, in order to have the data well-shaped for time-series representation with Altair.
Data overview after preprocessing:
My "final" goal would be to show, in the same graphic, an interactive line plot with a set of lines representing each one an agricultural parcel, with parcels categorized by crop types in different colours, e.g. corn in green, wheat in yellow, peer trees in brown... (the information containing the crop type of each parcel can be added to the DataFrame making a join with another DataFrame).
I am thinking of something looking more or less like the following example, with legend's years being the parcels coloured by crop types:
But so far I haven't managed to make my data look this way... at all.
As you can see there are many nulls in the data (this is due to the application of a cloud masking function and to the fact that there are several Sentinel-2 orbits intersecting the ROI). I would like to just omit the non-null values for earch column/parcel, but I don't know if this data configuration can pose problems (any advice on that?).
So far I got:
The generation of the preceding graphic, for a single parcel, takes already around 23 seconds. Which is something maybe shoud/cloud be improved (how?)
And more importantly, the expected line representing the item/polygon/parcel's values (NDVI) is not even shown in the plot (note that I chose a parcel containing rather few non-null values).
For sure I am doing many things wrong. Would be great to get some advice to solve (some of) them.
Sample of the data and code to reproduce the issue
Here's a text sample of the data in JSON format, and the code used to reproduce the issue is the following:
import pandas as pd
import geopandas as gpd
import altair as alt
df= pd.read_json(r"path\to\json\file.json")
df['date']= pd.to_datetime(df['date'])
print(gdf.dtypes)
df
Output:
lines=alt.Chart(df).mark_line().encode(
x='date:O',
y='17811:Q',
color=alt.Color(
'17811:Q', scale=alt.Scale(scheme='redyellowgreen', domain=(-1, 1)))
)
lines.properties(width=700, height=600).interactive()
Output:
Thanks in advance for your help!
If I understand correctly, it is mostly the format of your dataframe that needs to be changed from wide to long, which you can do either via .melt in pandas or .transform_fold in Altair. With melt, the default names are 'variable' (the previous columns name) and 'value' (the value for each column) for the melted columns:
alt.Chart(df.melt(id_vars='date'), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
The gaps comes from the NaNs; if you want Altair to interpolate missing values, you can drop the NaNs:
alt.Chart(df.melt(id_vars='date').dropna(), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
If you want to do it all in Altair, the following is equivalent to the last pandas example above (the transform uses 'key' instead of 'variable' as the name for the former columns). I also use and ordinal instead of nominal type for the color encoding to show how to make the colors more similar to your example.:
alt.Chart(df, width=500).mark_line().encode(
x='date:T',
y='value:Q',
color=alt.Color('key:O')
).transform_fold(
df.drop(columns='date').columns.tolist()
).transform_filter(
'isValid(datum.value)'
)

General Linear Model - Repeated Measures with Covariates, Estimated Marginal Means are not adjusting? Bug?

I am running a Repeated Measures two-way ANCOVA. The model produces an Estimated Marginal Means table, but the values are exactly the same (to the hundredths decimal place) as the Means in the descriptive statistics, despite there being a note at the bottom of the EMM table indicating that "the covariates appearing in the model are evaluated at the following values:..."
Is this a bug, or could I be doing something wrong?
Update:
Responding to question below, I should note that I used the drop down menus to run the analysis; however, this is the code that is used when I 'paste' the code.
DATASET ACTIVATE DataSet1.
GLM FT10 FT11 FT12 FT13 FT14 FT15 FT16 FT17 FT18 FT19 FT110 FT111 FT20 FT21 FT22 FT23 FT24 FT25 FT26 FT27 FT28 FT29 FT210 FT211 WITH SpatialScore FPSRTScore LDMean VGTotal
/WSFACTOR=Matching 2 Polynomial Trial 12 Polynomial
/METHOD=SSTYPE(3)
/PLOT=PROFILE(Trial*Matching) TYPE=LINE ERRORBAR=CI MEANREFERENCE=NO AXIS=AUTO
/EMMEANS=TABLES(OVERALL) WITH(SpatialScore=MEAN FPSRTScore=MEAN LDMean=MEAN VGTotal=MEAN)
/EMMEANS=TABLES(Matching) WITH(SpatialScore=MEAN FPSRTScore=MEAN LDMean=MEAN VGTotal=MEAN)COMPARE ADJ(BONFERRONI)
/EMMEANS=TABLES(Trial) WITH(SpatialScore=MEAN FPSRTScore=MEAN LDMean=MEAN VGTotal=MEAN)COMPARE ADJ(BONFERRONI)
/EMMEANS=TABLES(Matching*Trial) WITH(SpatialScore=MEAN FPSRTScore=MEAN LDMean=MEAN VGTotal=MEAN)
/PRINT=DESCRIPTIVE ETASQ
/CRITERIA=ALPHA(.05)
/WSDESIGN=Matching Trial Matching*Trial
/DESIGN=SpatialScore FPSRTScore LDMean VGTotal.
This is expected behavior. The reason that the EMMEANS don't differ from the observed means is that the covariate adjustment is done at the cell level in terms of between-subjects effects, and you have only one cell because you don't have any between-subjects factors.

Categorical PCA (CATPCA) in SPSS (23)

I am trying to conduct nonlinear principal component analysis using CATPCA in SPSS. I am following [a tutorial] (http://www.ncbi.nlm.nih.gov/pubmed/22176263) by Linting & Kooij (2012) and did not find that certain steps are straightforward. For the timebeing, my questions are:
How do I get a screeplot within CATPCA. The authors describe it as a necessary step but I can't seem to find it within the CATPCA drop-down menu.
Similarly, the tutorial describes the use of bootstrap confidence interval to test the significance of the factor loadings but the Bootstrap Confidence Ellipses option under the Save menu seems disabled (or I can't seem to activate those). What am I missing?
These are the most pressing questions that I encountered thus far. Thank you.
CATPCA does not produce a scree plot. You can create one manually by copying the eigenvalues out of the Model Summary table in the output, or (if you will need to create a lot of scree plots) you can use the SPSS Output Management System (OMS) to automate pulling the values out of the table and creating the plot.
In order to enable the the Bootstrap Confidence Ellipses controls on the Save subdialog, you need to check "Perform bootstrapping" on the Bootstrap subdialog.
See footnote of the literature (Linting & Kooij,2012 p20). "Eigenvalues are from the bottom row of the Correlations transformed variables table." You can create scree plot from these eigenvalues.

How to find a function that fits a given data set?

The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.
The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.

How to encode dependency path as a feature for classification?

I am trying to implement relation extraction between verb pairs. I want to use dependency path from one verb to the other as a feature for my classifier (predicts if relation X exists or not). But I am not sure how to encode the dependency path as a feature. Following are some example dependency paths, as space separated relation annotations from StanfordCoreNLP Collapsed Dependencies:
nsubj acl nmod:from acl nmod:by conj:and
nsubj nmod:into
nsubj acl:relcl advmod nmod:of
It is important to keep in mind that these path are of variable length and a relation could reappear without any restriction.
Two compromising ways of encoding this feature that come to my mind are:
1) Ignore the sequence, and just have one feature for each relation with its value being the number of times it appears in the path
2) Have a sliding window of length n, and have one feature for each possible pair of relations with the value being the number of times those two relations appeared consecutively. I suppose this is how one encodes n-grams. However, the number of possible relations is 50, which means I cannot really go with this approach.
Any suggestions are welcomed.
We had a project that built a classifier based off of dependency paths. I asked the group member who developed the system, and he said:
indicator feature for the whole path
So if you have the training data point (verb1 -e1-> w1 -e2-> w2 -e3-> w3 -e4-> verb2, relation1) the feature would be (e1-e2-e3-e4)
And he also did ngram sequences, so for that same data point, you would also have (e1), (e2), (e3), (e4), (e1-e2), (e2-e3), (e3-e4), (e1-e2-e3), (e2-e3-e4)
He also recommended collapsing appositive edges to make the paths smaller.
Also, I should note that he developed a set of high precision rules for each relation, and used this to create a large set of training data.

Resources