SPSS Independent Samples t Test with no grouping column & all of the date in one row - spss

This might be quite basic question, but:
Lets say I have a study where reaction times are measured twice before drinking alcohol and twice after drinking specific amount of alcohol, and hypothesis is that alcohol would increase the reaction time.
I have got my data in SPSS in the following format:
id | name| time_a | time_b | time_mean | time_a_alcohol | time_b_alcohol | time_mean_alcohol|
1| john| 0.17| 0.21| 0.19| 0.20| 0.24| 0.22|
2| bob| 0.15| 0.25| 0.20| 0.20| 0.30| 0.35|
I would like to do a independent Samples t-test, which I believe I could do if the data were set as following
id | name| alcohol| time_a | time_b | time_mean|
1| john| 0| 0.17| 0.21| 0.19|
1| john| 1| 0.20| 0.24| 0.22|
2| bob| 0| 0.15| 0.25| 0.20|
2| bob| 1| 0.20| 0.30| 0.25|
Where I could have the alcohol as the grouping value. However, my data isn't in that format as of now, as all of it is in one row.
Is there an option to do in the SPSS with one row so I could "time_mean" and "time_mean_alcohol" grouped without having to put them on two different rows; if not, is there a simple script to write to split the data?

You could calculate those means in the same row (and then run the analysis on them) like this:
compute time_mean=mean(time_a, time_b).
compute time_mean_alcohol=mean(time_a_alcohol, time_b_alcohol).
On the other hand, you can reach the long format as you described using this code:
varstocases /make time_a from time_a time_a_alcohol/make time_b from time_b time_b_alcohol/index=ind(time_a).
compute alcohol=char.index(ind, "alcohol")>0.
compute time_mean=mean(time_a, time_b).
exe.
NOTE: this looks to me like a case for paired-samples test rather than independant samples.

Related

how to get higher quality or accuracy of own made Haar cascade

Could anyone please just roughly tell me what is minimal hit rate, false alarm rate and how do I set width and height for the training purpose. I have already read through documentation on cv2 and also google some of it but in fact it didnt help me much. I have already done my first cascade but it didnt work well and quite horrible. Please just roughly tell me what happen if i change the value of these rate. Im using GUI haar cascade trainer on window. Thanks in advance.
not an answer, but a hint:
e.g. if you have, after stage 0, this result:
NEG count : acceptanceRatio 40000 : 1
Precalculation time: 24.031
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 0.995179| 0.0838|
+----+---------+---------+
END>
Training until now has taken 0 days 0 hours 1 minutes 57 seconds.
and for stage 1 you get:
NEG count : acceptanceRatio 40000 : 0.124695
Precalculation time: 19.241
+----+---------+---------+
| N | HR | FA |
+----+---------+---------+
| 1| 1| 1|
+----+---------+---------+
| 2| 1| 1|
+----+---------+---------+
| 3| 0.999077| 0.142975|
+----+---------+---------+
END>
Training until now has taken 0 days 0 hours 4 minutes 9 seconds.
this means the classifier is quite simple in the beginning. FA of 0.0838 vs. acceptanceRatio 40000 : 0.124695 means, that generalization is ok, so far (0.0838 is close to 0.124695) but there is some gap, so negative samples might be diverse enough. In stage 2, NEG count : acceptanceRatio 40000 : 0.034703 shows, that generalization is still on a good way. 0.124695 * 0.142975 = 0.01782826762 though .
From my experience, the acceptance ratio is one of the most important things to observe during training, to show you the quality of your training data.

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

Which Starspace training mode to use for multi-level embeddings

I am using the StarSpace embedding framework for the first time and am unclear on the "modes" that it provides for training and the differences between them.
The options are:
wordspace
sentencespace
articlespace
tagspace
docspace
pagespace
entityrelationspace/graphspace
Let's say I have a dataset that looks like this:
| Author | City | Tweet_ID | Tweet_contents |
|:-------|:-------|:----------|:-----------------------------------|
| A | NYC | 1 | "This is usually a short sentence" |
| A | LONDON | 2 | "Another short sentence" |
| B | PARIS | 3 | "Check out this cool track" |
| B | BERLIN | 4 | "I like turtles" |
| C | PARIS | 5 | "It was a dark and stormy night" |
| ... | ... | ... | ... |
(In reality, my dataset is not a language data and looks nothing like this, but this example demonstrates the point well enough.)
I would like to simultaneously create embeddings from scratch (not using pre-existing embeddings at any point) for each of the following:
Authors
Cities
Tweet/Sentences/Documents (EG. 1, 2, 3, 4, 5, etc.)
Words (EG. 'This', 'is', 'usually', ..., 'stormy', 'night', etc.)
Even after reading the coumentation, it doesn't seem clear which 'mode' of starspace training I should be using.
If anyone could help me understand how to interpret the modes to help select the appropriate one, that would be much appreciated.
I would also like to know if there are conditions under which the embeddings generated using one of the modes above, would in some way be equivalent to the embeddings built using a different mode (ignoring the fact that the embeddings would be different because of the non-determinstic nature of the process.)
Thank you

PySpark ML: Get KMeans cluster statistics

I have built a KMeansModel. My results are stored in a PySpark DataFrame called
transformed.
(a) How do I interpret the contents of transformed?
(b) How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters?
from pyspark.ml.clustering import KMeans
# Trains a k-means model.
kmeans = KMeans().setK(14).setSeed(1)
model = kmeans.fit(X_spark_scaled) # Fits a model to the input dataset with optional parameters.
transformed = model.transform(X_spark_scaled).select("features", "prediction") # X_spark_scaled is my PySpark DataFrame consisting of 13 features
transformed.show(5, truncate = False)
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|features |prediction|
+------------------------------------------------------------------------------------------------------------------------------------+----------+
|(14,[4,5,7,8,9,13],[1.0,1.0,485014.0,0.25,2.0,1.0]) |12 |
|(14,[2,7,8,9,12,13],[1.0,2401233.0,1.0,1.0,1.0,1.0]) |2 |
|(14,[2,4,5,7,8,9,13],[0.3333333333333333,0.6666666666666666,0.6666666666666666,2429111.0,0.9166666666666666,1.3333333333333333,3.0])|2 |
|(14,[4,5,7,8,9,12,13],[1.0,1.0,2054748.0,0.15384615384615385,11.0,1.0,1.0]) |11 |
|(14,[2,7,8,9,13],[1.0,43921.0,1.0,1.0,1.0]) |1 |
+------------------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 5 rows
As an aside, I found from another SO post that I can map the features to their names like below. It would be nice to have summary statistics (mean, median, std, min, max) for each feature of each cluster in one or more Pandas dataframes.
attr_list = [attr for attr in chain(*transformed.schema['features'].metadata['ml_attr']['attrs'].values())]
attr_list
Per request in the comments, here is a snapshot consisting of 2 records of the data (don't want to provide too many records -- proprietary information here)
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
|device_type_robot_pct|device_type_smart_tv_pct|device_type_desktop_pct|device_type_tablet_pct|device_type_mobile_pct|device_type_mobile_persist_pct|visitors_seen_with_anonymiser_pct|ip_time_span| ip_weight|mean_ips_per_visitor|visitors_seen_with_multi_country_pct|international_visitors_pct|visitors_seen_with_multi_ua_pct|count_tuids_on_ip| features| scaledFeatures|
+---------------------+------------------------+-----------------------+----------------------+----------------------+------------------------------+---------------------------------+------------+-------------------+--------------------+------------------------------------+--------------------------+-------------------------------+-----------------+--------------------+--------------------+
| 0.0| 0.0| 0.0| 0.0| 1.0| 1.0| 0.0| 485014.0| 0.25| 2.0| 0.0| 0.0| 0.0| 1.0|(14,[4,5,7,8,9,13...|(14,[4,5,7,8,9,13...|
| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| 0.0| 2401233.0| 1.0| 1.0| 0.0| 0.0| 1.0| 1.0|(14,[2,7,8,9,12,1...|(14,[2,7,8,9,12,1...|
As Anony-Mousse has commented, (Py)Spark ML is indeed much more limited that scikit-learn or other similar packages, and such functionality is not trivial; nevertheless, here is a way to get what you want (cluster statistics):
spark.version
# u'2.2.0'
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
# toy data - 5-d features including sparse vectors
df = spark.createDataFrame(
[(Vectors.sparse(5,[(0, 164.0),(1,520.0)]), 1.0),
(Vectors.dense([519.0,2723.0,0.0,3.0,4.0]), 1.0),
(Vectors.sparse(5,[(0, 2868.0), (1, 928.0)]), 1.0),
(Vectors.sparse(5,[(0, 57.0), (1, 2715.0)]), 0.0),
(Vectors.dense([1241.0,2104.0,0.0,0.0,2.0]), 1.0)],
["features", "target"])
df.show()
# +--------------------+------+
# | features|target|
# +--------------------+------+
# |(5,[0,1],[164.0,5...| 1.0|
# |[519.0,2723.0,0.0...| 1.0|
# |(5,[0,1],[2868.0,...| 1.0|
# |(5,[0,1],[57.0,27...| 0.0|
# |[1241.0,2104.0,0....| 1.0|
# +--------------------+------+
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(df.select('features'))
transformed = model.transform(df).select("features", "prediction")
transformed.show()
# +--------------------+----------+
# | features|prediction|
# +--------------------+----------+
# |(5,[0,1],[164.0,5...| 1|
# |[519.0,2723.0,0.0...| 2|
# |(5,[0,1],[2868.0,...| 0|
# |(5,[0,1],[57.0,27...| 2|
# |[1241.0,2104.0,0....| 2|
# +--------------------+----------+
Up to here, and regarding your first question:
How do I interpret the contents of transformed?
The features column is just a replication of the same column in your original data.
The prediction column is the cluster to which the respective data record belongs to; in my example, with 5 data records and k=3 clusters, I end up with 1 record in cluster #0, 1 record in cluster #1, and 3 records in cluster #2.
Regarding your second question:
How do I create one or more Pandas DataFrame from transformed that would show summary statistics for each of the 13 features for each of the 14 clusters?
(Note: seems you have 14 features and not 13...)
This is a good example of a seemingly simple task for which, unfortunately, PySpark does not provide ready functionality - not least because all features are grouped in a single vector features; to do that, we must first "disassemble" features, effectively coming up with the invert operation of VectorAssembler.
The only way I can presently think of is to revert temporarily to an RDD and perform a map operation [EDIT: this is not really necessary - see UPDATE below]; here is an example with my cluster #2 above, which contains both dense and sparse vectors:
# keep only cluster #2:
cl_2 = transformed.filter(transformed.prediction==2)
cl_2.show()
# +--------------------+----------+
# | features|prediction|
# +--------------------+----------+
# |[519.0,2723.0,0.0...| 2|
# |(5,[0,1],[57.0,27...| 2|
# |[1241.0,2104.0,0....| 2|
# +--------------------+----------+
# set the data dimensionality as a parameter:
dimensionality = 5
cluster_2 = cl_2.drop('prediction').rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['x'+str(i) for i in range(dimensionality)])
cluster_2.show()
# +------+------+---+---+---+
# | x0| x1| x2| x3| x4|
# +------+------+---+---+---+
# | 519.0|2723.0|0.0|3.0|4.0|
# | 57.0|2715.0|0.0|0.0|0.0|
# |1241.0|2104.0|0.0|0.0|2.0|
# +------+------+---+---+---+
(If you have your initial data in a Spark dataframe initial_data, you can change the last part to toDF(schema=initial_data.columns), in order to keep the original feature names.)
From this point, you could either convert cluster_2 dataframe to a pandas one (if it fits in your memory), or use the describe() function of Spark dataframes to get your summary statistics:
cluster_2.describe().show()
# result:
+-------+-----------------+-----------------+---+------------------+---+
|summary| x0| x1| x2| x3| x4|
+-------+-----------------+-----------------+---+------------------+---+
| count| 3| 3| 3| 3| 3|
| mean|605.6666666666666| 2514.0|0.0| 1.0|2.0|
| stddev|596.7389155512932|355.0929455790413|0.0|1.7320508075688772|2.0|
| min| 57.0| 2104.0|0.0| 0.0|0.0|
| max| 1241.0| 2723.0|0.0| 3.0|4.0|
+-------+-----------------+-----------------+---+------------------+---+
Using the above code with dimensionality=14 in your case should do the job...
Annoyed with all these (arguably useless) significant digits in mean and stddev? As a bonus, here is a small utility function I had come up some time ago for a pretty summary:
def prettySummary(df):
""" Neat summary statistics of a Spark dataframe
Args:
pyspark.sql.dataframe.DataFrame (df): input dataframe
Returns:
pandas.core.frame.DataFrame: a pandas dataframe with the summary statistics of df
"""
import pandas as pd
temp = df.describe().toPandas()
temp.iloc[1:3,1:] = temp.iloc[1:3,1:].convert_objects(convert_numeric=True)
pd.options.display.float_format = '{:,.2f}'.format
return temp
stats_df = prettySummary(cluster_2)
stats_df
# result:
summary x0 x1 x2 x3 x4
0 count 3 3 3 3 3
1 mean 605.67 2,514.00 0.00 1.00 2.00
2 stddev 596.74 355.09 0.00 1.73 2.00
3 min 57.0 2104.0 0.0 0.0 0.0
4 max 1241.0 2723.0 0.0 3.0 4.0
UPDATE: Thinking of it again, and seeing your sample data, I came up with a more straightforward solution, without the need to invoke an intermediate RDD (an operation that one would arguably prefer to avoid, if possible)...
The key observation is the complete contents of transformed, i.e. without the select statements; keeping the same toy dataset as above, we get:
transformed = model.transform(df) # no 'select' statements
transformed.show()
# +--------------------+------+----------+
# | features|target|prediction|
# +--------------------+------+----------+
# |(5,[0,1],[164.0,5...| 1.0| 1|
# |[519.0,2723.0,0.0...| 1.0| 2|
# |(5,[0,1],[2868.0,...| 1.0| 0|
# |(5,[0,1],[57.0,27...| 0.0| 2|
# |[1241.0,2104.0,0....| 1.0| 2|
# +--------------------+------+----------+
As you can see, whatever other columns are present in the dataframe df to be transformed (just one in my case - target) just "pass-through" the transformation procedure and end-up being present in the final outcome...
Hopefully you start getting the idea: if df contains your initial 14 features, each one in a separate column, plus a 15th column named features (roughly as shown in your sample data, but without the last column), then the following code:
kmeans = KMeans().setK(14)
model = kmeans.fit(df.select('features'))
transformed = model.transform(df).drop('features')
will leave you with a Spark dataframe transformed containing 15 columns, i.e. your initial 14 features plus a prediction column with the corresponding cluster number.
From this point, you can proceed as I have shown above to filter specific clusters from transformed and get your summary statistics, but you'll have avoided the (costly...) conversion to intermediate temporary RDDs, thus keeping all your operations in the more efficient context of Spark dataframes...

Description matching in record linkage using Machine learning Approach

We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.

Resources