Assume I have the following data
DATA LIST FREE / sex (A) year.
BEGIN DATA
m 2011
m 2011
m 2012
f 2011
f 2011
f 2011
f 2011
f 2012
f 2012
END DATA.
How can I plot a line of how the proportions of males and females change over the years.
Not the absolute values and not the total percentages, but the percentages per year.
I also need a crosstab where the percentages per year are shown.
A syntax would be nice, thank you.
The crosstabs syntax would simply be CROSSTABS TABLE Year By Sex /CELLS = Col.. The graph you want you can actually build through the GUI, to use the summary functions per year though you need to specify the year variable as either ordinal or nominal.
Here is the GGRAPH code the GUI printed out for me. Clean up as needed.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=year[LEVEL=ORDINAL] COUNT()[name="COUNT"] sex
MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: year=col(source(s), name("year"), unit.category())
DATA: COUNT=col(source(s), name("COUNT"))
DATA: sex=col(source(s), name("sex"), unit.category())
GUIDE: axis(dim(1), label("year"))
GUIDE: axis(dim(2), label("Percent"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label("sex"))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(summary.percent(year*COUNT, base.coordinate(dim(1)))),
color.interior(sex), missing.wings())
END GPL.
Related
I want to generate boxplots for all my variables (90 in total).
This is the syntax I would use for one variable:
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=AnScol MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: AnScol =col(source(s), name("AnScol"))
DATA: id=col(source(s), name("$CASENUM"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("AnScol"))
ELEMENT: schema(position(bin.quantile.letter(AnScol)), label(id))
END GPL
How can I do that for all my variables without changing each variable one by one?
Thank you in advance!
Maxime M.
Here I will illustrate a few different ways. First, lets make some fake data - ten numeric variables and one string representing the Id variable for the dataset.
*Make fake data.
MATRIX.
SAVE {UNIFORM(100,10)} /VARS = V1 TO V10 /OUTFILE = *.
END MATRIX.
DATASET NAME Sim.
COMPUTE MyId = $casenum.
FORMATS MyId (F3.0).
ALTER TYPE MyId (A3).
A simple solution is to use EXAMINE to plot the box plot variable wise. This is available via the legacy dialogue type graphs.
*If the variables all have the same scale.
EXAMINE VARIABLES=V1 TO V10
/COMPARE VARIABLE
/PLOT=BOXPLOT
/STATISTICS=NONE
/NOTOTAL
/ID=MyId
/MISSING=LISTWISE.
This works fine since all the variables were constructed to be on the same scale. It has no outliers - but if there were they would be labelled with the MyId variable in the above plot.
You can also use GGRAPH to accomplish something very similar. Here I put outliers though, and in the GGRAPH code you cannot easily make an outlier variable.
*Make one variable not on the same scale and have outliers.
COMPUTE V1 = V1*100.
IF MyId = " 5" V1 = 250.
EXECUTE.
*Synonymous with GGRAPH - cant label outliers though.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=V1 TO V10
TRANSFORM=VARSTOCASES(SUMMARY="V" INDEX="Vars")
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Vars=col(source(s), name("Vars"), unit.category())
DATA: V=col(source(s), name("V"))
ELEMENT: schema(position(bin.quantile.letter(Vars*V)))
END GPL.
You can see here because V1 is on a different scale now, you can't effectively visualize the other variables on one plot. In realistic datasets this is what will happen. To do individual plots, you can take eli-k's advice and use Python to submit different plots for each variable. Here is an example of that.
*If they don't - and you want different scales, Python programmability can do that.
BEGIN PROGRAM Python.
import spss, string
#get the variable list and the variable type
varList = [(spss.GetVariableName(i),spss.GetVariableType(i)) for i in range(spss.GetVariableCount())]
#make a template to submit boxplot, see https://andrewpwheeler.wordpress.com/2015/02/22/string-substitution-in-python-continued/
c = string.Template("""*Boxplots.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=$var MyId MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: V=col(source(s), name("$var"))
DATA: MyId=col(source(s), name("MyId"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("$var"))
ELEMENT: schema(position(bin.quantile.letter(V)), label(MyId))
END GPL.
""")
#loop over the varlist and plot if numeric
for i in varList:
if i[1] == 0:
spss.Submit(c.substitute(var=i[0]))
END PROGRAM.
And now you can see that each variable gets its own box plot to check quickly (and has the ID labeled).
etc. Finally, a way to accomplish something similar is to reshape all the numeric variables into one column, and then use SPLIT FILE.
*Varstocases and split file - need to know which variables are numeric to begin with.
VARSTOCASES /MAKE V FROM V1 TO V10 /INDEX VOrig.
SORT CASES BY VOrig.
SPLIT FILE BY VOrig.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=V MyId MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: V=col(source(s), name("V"))
DATA: MyId=col(source(s), name("MyId"), unit.category())
COORD: rect(dim(1), transpose())
GUIDE: axis(dim(1), label("V"))
ELEMENT: schema(position(bin.quantile.letter(V)), label(MyId))
END GPL.
SPLIT FILE OFF.
I am trying to draw a boxplot using only Q1, Q3, Max, Min and Mean values as I don't have the whole data, can anyone help me with that?
Thanks
Well, it is not a box plot anymore (the whiskers in a traditional box plot are not set to the minimum and maximum values), so you want to be very clear in the notes about what this chart is showing. But given that information one can build a similar looking chart by superimposing the various elements. Example below:
DATA LIST FREE / Id Min Q1 Mean Q3 Max.
BEGIN DATA
1 1 2 3 4 5
2 1 3 5 7 9
3 1 5 8 8 10
END DATA.
FORMATS All (F2.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Id Min Q1 Mean Q3 Max
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Id=col(source(s), name("Id"), unit.category())
DATA: Min=col(source(s), name("Min"))
DATA: Q1=col(source(s), name("Q1"))
DATA: Mean=col(source(s), name("Mean"))
DATA: Q3=col(source(s), name("Q3"))
DATA: Max=col(source(s), name("Max"))
GUIDE: axis(dim(1), label("Id"))
GUIDE: axis(dim(2), label("Variable"))
ELEMENT: edge(position(Id*(Min+Max)))
ELEMENT: bar(position(region.spread.range(Id*(Q1+Q3))))
ELEMENT: point(position(Id*Mean), color.interior(color.grey), size(size."12"))
END GPL.
Let's say you want to create a line graph which plots a line for the amount of money coming in, and a line for the amount of money going out.
The variable (moneyIn) cases for money coming in is positive, like '30,000', but in this case, the amount of money being expended (moneyOut) is negative, like '-19,000'.
When I use a line graph to plot these results against eachother across a duration of time, one line is plotted way below in the negative numbers, and the other is plotted with the positive numbers, way above - so they're difficult to compare against one another.
Is there a way to change the negative values into positive ones JUST for the line graph, without computing a new variable or changing the database? I think it would essentially be a sum of (moneyOut*-1), but I don't know if this can be implemented JUST for the chart?
You can use the TRANS statement in inline GPL code to flip the sign. Example below.
DATA LIST FREE / In Out (2F5.0) Time (F1.0).
BEGIN DATA
1000 -1500 1
2000 -2500 2
3000 -3500 3
4000 -4500 4
END DATA.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Time In Out
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Time=col(source(s), name("Time"), unit.category())
DATA: In=col(source(s), name("In"))
DATA: Out=col(source(s), name("Out"))
TRANS: OutPos = eval(Out*-1)
GUIDE: axis(dim(1), label("Time"))
GUIDE: axis(dim(2), label("Values"))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(Time*In))
ELEMENT: line(position(Time*OutPos), color(color.blue))
END GPL.
Some explaining facts in the beginning:
I have got my data structured in SPSS in the following way.
I've got 20 variables (case_number, a_1, b_1, c_1, a_2, b_2, c_2, ....)
The variables are named in such a way because I took repeated measures (at different points of time, here named 1 and 2) with different devices (named a, b and c). All devices are supposed to measure the same.
What I want to do now is create a scatter plot for all devices and all points of time, e.g. I would like to have device a on the x-axis and devices b and c on the y-axis and then plot
(a_1, b_1)
(a_1, c_1)
(a_2, b_2)
(a_2, c_2)
and so on.
I would like all points that use device b on the y-axis to have the same color (e.g. green), points using device c should have another color (e.g. red).
I do NOT want to use different colors for different points of time, so both (a_1, b_1) and (a_2, b_2) should be green.
Your particular example is easier to construct if you have the data in long format as opposed to wide format. Below is an example.
*Make some fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP ID = 1 TO 50.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
VECTOR a_(3).
VECTOR b_(3).
VECTOR c_(3).
DO REPEAT v = a_1 TO c_3.
COMPUTE v = RV.NORMAL(0,1).
END REPEAT.
EXECUTE.
*Reshape from wide to long.
VARSTOCASES
/MAKE a FROM a_1 TO a_3
/MAKE b FROM b_1 TO b_3
/MAKE c FROM c_1 TO c_3
/INDEX Time.
FORMATS a b c Time (F2.0).
*Now make scatterplot.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=a b c
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: a=col(source(s), name("a"))
DATA: b=col(source(s), name("b"))
DATA: c=col(source(s), name("c"))
GUIDE: axis(dim(1), label("a"))
GUIDE: axis(dim(2), label("b and c"))
ELEMENT: point(position(a*b), color.interior(color.green))
ELEMENT: point(position(a*c), color.interior(color.red))
END GPL.
This produces the plot I believe you asked for:
In long format you have several other simple options as well, like constructing small multiples for each time period or using different symbols for each time period.
*Small multiple graphs.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=a b c Time
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: a=col(source(s), name("a"))
DATA: b=col(source(s), name("b"))
DATA: c=col(source(s), name("c"))
DATA: Time=col(source(s), name("Time"), unit.category())
COORD: rect(dim(1,2))
GUIDE: axis(dim(1), label("a"))
GUIDE: axis(dim(2), label("b and c"))
GUIDE: axis(dim(3), opposite())
ELEMENT: point(position(a*b*Time), color.interior(color.green))
ELEMENT: point(position(a*c*Time), color.interior(color.red))
END GPL.
*Different shapes for different time periods.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=a b c Time ID
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: a=col(source(s), name("a"))
DATA: b=col(source(s), name("b"))
DATA: c=col(source(s), name("c"))
DATA: Time=col(source(s), name("Time"), unit.category())
DATA: ID=col(source(s), name("ID"), unit.category())
COORD: rect(dim(1,2))
GUIDE: axis(dim(1), label("a"))
GUIDE: axis(dim(2), label("b and c"))
GUIDE: axis(dim(3), opposite())
ELEMENT: point(position(a*b), color.interior(color.green), shape(Time))
ELEMENT: point(position(a*c), color.interior(color.red), shape(Time))
END GPL.
Another option is to draw the traces of each individual. In this sample because the data are quite disorderly they are not appropriate, but most time series data will show smoother trends. Here is an example small multiple of the traces for the first 5 observations in their own small multiples, for this example data. (See here for some discussion on these diagrams and nice examples.)
*Path traces.
TEMPORARY.
SELECT IF ID <= 5.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=a b c Time ID
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: a=col(source(s), name("a"))
DATA: b=col(source(s), name("b"))
DATA: c=col(source(s), name("c"))
DATA: Time=col(source(s), name("Time"), unit.category())
DATA: ID=col(source(s), name("ID"), unit.category())
COORD: rect(dim(1,2), wrap())
GUIDE: axis(dim(1), label("a"))
GUIDE: axis(dim(2), label("b and c"))
GUIDE: axis(dim(3), opposite())
ELEMENT: point(position(a*b*ID), color.interior(color.green), shape(Time))
ELEMENT: point(position(a*c*ID), color.interior(color.red), shape(Time))
ELEMENT: path(position(a*b*ID))
ELEMENT: path(position(a*c*ID))
END GPL.
EXECUTE.
The updated code in the comment meant to generate a legend works fine for me, with the exception of the inline template (which might conflict with my personal chart template). If you want to add a regression line to the plot see the smooth.linear function in the GPL reference guide.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=a b c
/GRAPHSPEC SOURCE=INLINE INLINETEMPLATE=["<addFitline type='linear' target='pair'/>"].
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: a=col(source(s), name("a"))
DATA: b=col(source(s), name("b"))
DATA: c=col(source(s), name("c"))
GUIDE: axis(dim(1), label("a"))
GUIDE: axis(dim(2), label("b and c"))
SCALE: cat(aesthetic(aesthetic.color.interior), map(("b", color.green), ("c", color.blue)))
ELEMENT: point(position(a*b), color.interior("b"))
ELEMENT: point(position(a*c), color.interior("c"))
END GPL.
I am trying to create a frequency chart to show the percentages from a multi-response set BDecideCX1 to BDecideCX9, but the percentages are based on the total number of codes rather than the total number of cases. I've tried using a BASE command to repercentage on a different base on the ELEMENT below, but with no success. Any help much appreciated.
TEMP.
SELECT IF BDecideCX1>=0.
MRSETS
/MDGROUP NAME=$temp
VARIABLES = BDecideCX1 to BDecideCX9
VALUE=1
LABEL="Who was involved in deciding how to spend the PE and sport premium?".
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=$temp RESPONSES() [NAME="RESPONSES"]
MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE TEMPLATE=["TCharts\FreqPurple.SGT"].
BEGIN GPL
SOURCE: s = userSource(id("graphdataset"))
DATA: temp=col(source(s), name("$temp"), unit.category())
DATA: responses=col(source(s), name("RESPONSES"))
SCALE: linear(dim(2), include(0))
GUIDE: text.title(label("Who was involved in deciding how to spend the PE and sport premium? FREQUENCY"))
GUIDE: axis(dim(2), label("%"))
GUIDE: axis(dim(1), label("Who was involved in deciding how to spend the PE and sport premium?"))
ELEMENT: interval(position(summary.percent(temp*responses)))
END GPL.