Stata - Use gr_edit with name() - memory

I am getting into graphics in Stata. I have currently been using the gr_edit command to have code of what I have changed in the graph editor. Something like this:
histogram Car, frequency
gr_edit .plotregion1.plot1.style.editstyle area(shadestyle(color(navy))) editcopy
gr_edit .plotregion1.plot1.style.editstyle area(linestyle(color(navy))) editcopy
gr_edit .plotregion1.plot1._set_type rbarm
graph export ...
I want to combine this histogram with another histogram. To do this, I want to store both of them in memory somehow. The way I have found to do this is to include the name() option after the initial graph call. But I think that if I do this all of the gr_edit stuff will not be there. Is there anyway to use the name() function (or something like it) as a standalone function (and not just as an option) in order to store the graph in memory?

Assumning that you actually need the gr_edit lines, this works:
sysuse sp500, clear
* first graph
histogram volume, frequency saving(first)
* second graph
histogram volume, frequency
gr_edit .plotregion1.plot1.style.editstyle area(shadestyle(color(navy))) editcopy
gr_edit .plotregion1.plot1.style.editstyle area(linestyle(color(navy))) editcopy
gr_edit .plotregion1.plot1._set_type rbarm
graph save second
* combined
graph combine first.gph second.gph
* erase original files
rm first.gph
rm second.gph
See also help tempfile.

Related

Problems plotting time-series interactively with Altair

Description of the problem
My goal is quite basic: to plot time series in an interactive plot. After some research I decided to give a try to Altair.
There are already QGIS plugins for time-series visualisation, but as far as I'm aware, none for plotting time-series at vector-level, interactively clicking on a map and selecting a Polygon. So that's why I decided to go for a self-made solution using Altair, maybe combining it with Folium to add functionalities later on.
I'm totally new to the Altair library (as well as Vega and Vega-lite), and quite new in datascience and data visualisation as well... so apologies in advance for my ignorance!
There are already well explained tutorials on how to plot time series with Altair (for example here, or in the official website). However, my study case has some particularities that, as far as I've seen, have not yet been approached altogether.
The data is produced using the Python API for Google Earth Engine and preprocessed with Python and the pandas/geopandas libraries:
In Google Earth Engine, a vegetation index (NDVI in the current case) is computed at pixel-level for a certain region of interest (ROI). Then the function image.reduceRegions() is mapped across the ImageCollection to compute the mean of the ndvi in every polygon of a FeatureCollection element, which represent agricultural parcels. The resulting vector file is exported.
Under a Jupyter-lab environment, the data is loaded into a geopandas GeoDataFrame object and preprocessed, transposing the DataFrame and creating a datetime column, among others, in order to have the data well-shaped for time-series representation with Altair.
Data overview after preprocessing:
My "final" goal would be to show, in the same graphic, an interactive line plot with a set of lines representing each one an agricultural parcel, with parcels categorized by crop types in different colours, e.g. corn in green, wheat in yellow, peer trees in brown... (the information containing the crop type of each parcel can be added to the DataFrame making a join with another DataFrame).
I am thinking of something looking more or less like the following example, with legend's years being the parcels coloured by crop types:
But so far I haven't managed to make my data look this way... at all.
As you can see there are many nulls in the data (this is due to the application of a cloud masking function and to the fact that there are several Sentinel-2 orbits intersecting the ROI). I would like to just omit the non-null values for earch column/parcel, but I don't know if this data configuration can pose problems (any advice on that?).
So far I got:
The generation of the preceding graphic, for a single parcel, takes already around 23 seconds. Which is something maybe shoud/cloud be improved (how?)
And more importantly, the expected line representing the item/polygon/parcel's values (NDVI) is not even shown in the plot (note that I chose a parcel containing rather few non-null values).
For sure I am doing many things wrong. Would be great to get some advice to solve (some of) them.
Sample of the data and code to reproduce the issue
Here's a text sample of the data in JSON format, and the code used to reproduce the issue is the following:
import pandas as pd
import geopandas as gpd
import altair as alt
df= pd.read_json(r"path\to\json\file.json")
df['date']= pd.to_datetime(df['date'])
print(gdf.dtypes)
df
Output:
lines=alt.Chart(df).mark_line().encode(
x='date:O',
y='17811:Q',
color=alt.Color(
'17811:Q', scale=alt.Scale(scheme='redyellowgreen', domain=(-1, 1)))
)
lines.properties(width=700, height=600).interactive()
Output:
Thanks in advance for your help!
If I understand correctly, it is mostly the format of your dataframe that needs to be changed from wide to long, which you can do either via .melt in pandas or .transform_fold in Altair. With melt, the default names are 'variable' (the previous columns name) and 'value' (the value for each column) for the melted columns:
alt.Chart(df.melt(id_vars='date'), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
The gaps comes from the NaNs; if you want Altair to interpolate missing values, you can drop the NaNs:
alt.Chart(df.melt(id_vars='date').dropna(), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
If you want to do it all in Altair, the following is equivalent to the last pandas example above (the transform uses 'key' instead of 'variable' as the name for the former columns). I also use and ordinal instead of nominal type for the color encoding to show how to make the colors more similar to your example.:
alt.Chart(df, width=500).mark_line().encode(
x='date:T',
y='value:Q',
color=alt.Color('key:O')
).transform_fold(
df.drop(columns='date').columns.tolist()
).transform_filter(
'isValid(datum.value)'
)

Can Dask computational graphs keep intermediate data so re-compute is not necessary?

I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.

Scale-building in SPSS

I'm using Cronbach's alpha to analyze data in order to build/refine a scale. This is a tedious process in SPSS, since it doesn't automatically optimize the scale, so I'm hoping there is a way to use syntax to speed it up.
So I start with a set of items, set up the OMS control panel to capture the item-total statistics table, and the run the alpha analysis. This pushes the item-total stats into a new dataset. Then I check the alpha value, and use it in syntax to screen out items that have a greater alpha-if-deleted value.
Then I re-run the analysis with only the items passed the screening. And I repeat, until all the items pass the screening. Here is the syntax:
* First syntax sets up OMS, and then runs the alpha analysis.
* In the reliability syntax, I have to manually add the variables and the Scale name.
* OMS.
DATASET DECLARE alpha_worksheet.
OMS
/SELECT TABLES
/IF COMMANDS=['Reliability'] SUBTYPES=['Item Total Statistics']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_
OUTFILE='alpha_worksheet' VIEWER=YES.
RELIABILITY
/VARIABLES=
points_18618
points_18618
points_3286
points_3290
points_3583
points_4018
points_7775
points_7789
points_7792
points_18631
points_18652
/SCALE('2017 Fall CRN 4157 Exam 01 v. 1.0') ALL
/MODEL=ALPHA
/SUMMARY=TOTAL.
* Second syntax identifies any variables in the OMS dataset that are LTE the alpha value.
* I have to manually enter the alpha value...
DATASET ACTIVATE alpha_worksheet.
IF (CronbachsAlphaifItemDeleted <= .694) Keep =1.
EXECUTE.
SORT CASES BY Keep(D).
Ideally, instead of having to repeat this process over and over, I'd like syntax that would automate this process.
Hope that makes sense, and if you have a solution thanks in advance (this has been bugging me for years!) Cheers

Does autodiff in tensorflow work if I have a for loop involved in constructing my graph?

I have a situation where I have a batch of images and in each image I have to perform some operation over a tiny patch in that image. Now the problem is the patch size is variable in each image in the batch. So this implies that I cannot vectorize it. I could vectorize by considering the entire range of pixels in an image but my patch size per image is really a small fraction and I don't want to waste my memory here by performing the operation and storing the results for all the pixels in each image.
So in short, I need to use a loop. Now I see that tensorflow has just a while loop defined and no for loops. So my question is , if I use a plain python style for loop for performing operations over my tensor , will the autodiff fail to calculate gradients in my graph?
Tensorflow does not know (thus does not care) how the graph has been constructed, you can even write each node by hand as long as you use proper functions to do so. So in particular for loop has nothing to do with TF. TF while loop on the other hand gives you ability to express dynamic computations inside the graph, so if you want to process data in a sequence and only need a current one in the memory - only while loop can achieve that. If you create a huge graph by hand (through the loop) it will be always executed, and everything stored in memory. As long as this fits on your machine you should be fine. Another thing is the dynamic length, if sometimes you need to run a loop 10 times, and sometimes 1000, you have to use tf.while_loop, you cannot do this with for loop (unless you create separate graphs for each possible length).

setting threshold and batch processing in ImageJ (FIJI) macro

I know this has been posted elsewhere and that this is no means a difficult problem but I'm very new to writing macros in FIJI and am having a hard time even understanding the solutions described in various online resources.
I have a series of images all in the same folder and want to apply the same operations to them all and save the resultant excel files and images in an output folder. Specifically, I'd like to open, smooth the image, do a Max intensity Z projection, then threshold the images to the same relative value.
This thresholding is one step causing a problem. By relative value I mean that I would like to set the threshold so that the same % of the intensity histogram is included. Currently, in FIJI if you go to image>adjust>threshold you can move the sliders such that a certain percentage of the image is thresholded and it will display that value for you in the open window. In my case 98% is what I am trying to achieve, eg thresholding all but the top 2% of the data.
Once the threshold is applied to the MIP, I convert it to binary and do particle analysis and save the results (summary table, results, image overlay.
My approach has been to try and automate all the steps/ do batch processing but I have been having a hard time adapting what I have written to work based on instructions found online. Instead I've been just opening every image in the directory one by one and applying the macro that I wrote, then saving the results manually. Obviously this is a tedious approach so any help would be much appreciated!
What I have been using for my simple macro:
run("Smooth", "stack");
run("Z Project...", "projection=[Max Intensity]");
setAutoThreshold("Default");
//run("Threshold...");
run("Convert to Mask");
run("Make Binary");
run("Analyze Particles...", " show=[Overlay Masks] display exclude clear include summarize in_situ");
You can use the Process ▶ Batch ▶ Macro... command for this.
For further details, see the Batch Processing page of the ImageJ wiki.

Resources