Large file merging problems in SPSS - spss

I have a large dataset of over 4000 cases with over 500 variables. I want to add this set of variables to another dataset containing most of the same cases but only around 10 variables.
Both of the datasets contain an ID variable that allows me to match the cases. The larger dataset is a keyed table because there are cases in there that aren't in the smaller set and are therefore of no interest to me.
I'm very comfortable with merging the files but my problem arises when I look at the new dataset. The variables are in there but all the values turn up missing. This only applies to the variables that were added to the active dataset. I checked to see if the key variable had any duplicates and it didn't.
I wonder why this happens, and if there is a way to fix this?
I can add that I have done this very often before without this problem.

Related

Performance difference between measure name and tags

I have an IoT application where all data comes from the different sensors with a standard payload where all that changes is the variable ID which is a four digit hex string.
I currently use something like data.varID as my measurement name. The varID is also a tag, even if redundant. But this is somewhat inconvenient as some times I want to be able to easily query data across more than one varID.
I have tried to find the answer to this question but cannot find it: what’s the difference between
having lots of data.varID measurements
Or
have a single data measurement with varID as a tag
As I understand, both would be equivalent in terms of the number of time series in the database so is there any other consideration?
The types of queries I usually need are simple:
SELECT "value" FROM "db1"."autogen"."data.org1.global.5051" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace')
so basically getting data for a given variable across devices for a period of time. But in some cases, I also want to get more that one variable at a time, which is why I would like to instead do something like:
SELECT "value" FROM "db1"."autogen"."data.org1" WHERE time > now() - 24h AND ("device"='d--0000-0000-0000-0acf' OR "device"='d--0000-0000-0000-0ace') AND ("variable"="5051") AND ("variable"="5052")
but at this time, I would be putting everything on a single measurement, with "device", "variable" (and a couple other things) as tags.
So, is there any consideration I need to consider before switching to having a single measurement for my whole database?
Since nobody was able to answer this question, I will answer it the best I know understand it.
There does not seem to be any performance difference between one large measurement series Vs smaller measurement series.
But there is a critical difference, which in our case, ended up forcing us into multiple measurements:
In our case, while the schema between different measurements share the same fields, some measurements can have additional fields.
The problem is that fields seem to be associated to the measurement itself, so if we add
data,device=0bd8,var=5053 value=10 1574173550390000
data,device=0bd8,var=5053 value=10 1574173550400000
data,device=0bd8,var=5054 foo=12,value=10 1574173550390000
data,device=0bd8,var=5055 bar=10,value=10 1574173550390000
the fact that var 5054 has a foo field and 5055 has a bar field means that when you query any variable, you will get both foo and bar (set to None if they don't exist):
{'foo': None, 'bar': None}
This means that if you have 100 variables, and each add say, 5 custom fields, you will end up with 500 fields every time you query. And while this is not a storage issue, the fact that the fields are associated with the measurement means you will have an exponential growth on the JSON object returned by the database, even if most fields set to None.
If the schema was to be identical across all measurements, then it seems not to make a difference between using a single data measurement (with different tags) Vs. multiple data.<var> measurements.

Integrate multiple same structure datasets in one database

I have 8 different datasets with the same structure. I am using Neo4j and need to query all of them at different points on the website I am developing. What would be the approaches at storing the datasets in one database?
One idea that comes to my mind is to supply for each node an additional property that would distinguish nodes of one dataset from nodes of the other ones. But that seems too repetitive and wrong for me. The other idea is just to create 8 databases and query them separately but how could I do that? Running each one in its own port seems crazy.
Any suggestions would be greatly appreciated.
If your datasets are in a tree structure, you could add a different root node to each of them that you could use for reference, similar to GraphAware TimeTree. Another option (better than a property, I think) would be to differentiate each dataset by adding a specific label to nodes from that dataset (i.e. all nodes from "dataset A" get a :DataSetA label)
I imagine that the specific structure of your dataset may yield other options. For example, if you always begin traversals of the dataset from a few set locations, you only need to be able to determine which dataset the entry points are a part of, because once entered, all traversals would be made within the same dataset <-- if that makes sense.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

How do I programmatically merge cases from datasets with conflicting variable names?

I want to add cases from many SPSS dataset to one SPSS dataset.
Here's my code:
DATASET ACTIVATE DataSet1.
ADD FILES /FILE=*
/FILE='Path\to\dataset.sav'.
EXECUTE.
But I get this error: Mismatched variable types on the input files.
I want SPSS to ignore the conflicting columns and add cases only from the columns where there is no conflict.
How do I do this?
This occurs because variables of the same name in the two different data sources have either different format types (STRING, NUMERIC, DATE ect) or either they are both STRINGS but of different length.
The latter, string variables of different lenghts, can be solved like this:
DATA LIST FREE / V(A1).
BEGIN DATA.
a b c
END DATA.
DATASET NAME DS1.
DATA LIST FREE / V(A2).
BEGIN DATA.
1 2 3
END DATA.
DATASET NAME DS2.
STATS ADJUST WIDTHS VARIABLES=ALL WIDTH=MAX /FILES DS1 DS2.
DATASET ACTIVATE DS1.
ADD FILES FILE=* /FILE=DS2.
However, if you have mismatch of different format types then that is a tad more complicated to solve due to many different permutations, so you would probably want to asses which variables are problematic and harmonize/delete them before merging files. Probably worth carrying out this exercise nonetheless as having same variable names with different format type could be signs of erroneous data.
If you know which variables conflict, you can use the KEEP subcommand to select the others, or you can use the RENAME command to assign new names and adjust the results afterwards.
If you need to harmonize the names and the issue is something like differing string lengths for variables that should be the same, the STATS ADJUST WIDTHS extension command can harmonize the widths before you merge.

Merging files in spss

Merging files in spss
Hi,
I have a problem in merging files. Here's what I need to do: I have chosen 200 cases from 7000 in ArcMap (GIS-program). In the process I have lost some of the cases' variable information.
Now I would like to get the variables back to my smaller dataset, and I used data-> merge files > add variables, and ID as match, match cases on > keyvariables in sorted files > both files provide cases.
This gave a dataset of all the 7000 cases, only the variables already existed in the first table didn't add to the merged dataset. I tried also all different choises, but none of them gave me the result I wanted. This would be the 200 cases added with the variables that were lost in the process.
So in a nutshell how do I merge/replace the info from variables A (dataset) to variables B(dataset) without extra cases´ from A (only the info of the selected 200 cases´out of 7000)?
Out of hand:
Create a new variable in the reduced DataSet with the Value of 1.
Match the files.
Sort by the new variable.
Delete all cases who don't have the value 1 on this variable.
I don't see why you are choosing both files provide cases. You want to use the 7000-case file as a keyed table using ID as the key and match it with the 200-case file, which provides all the cases. Assuming that you select all the variables from the large file that you want, this should give you the desired result.

Resources