Which Starspace training mode to use for multi-level embeddings - embedding

I am using the StarSpace embedding framework for the first time and am unclear on the "modes" that it provides for training and the differences between them.
The options are:
wordspace
sentencespace
articlespace
tagspace
docspace
pagespace
entityrelationspace/graphspace
Let's say I have a dataset that looks like this:
| Author | City | Tweet_ID | Tweet_contents |
|:-------|:-------|:----------|:-----------------------------------|
| A | NYC | 1 | "This is usually a short sentence" |
| A | LONDON | 2 | "Another short sentence" |
| B | PARIS | 3 | "Check out this cool track" |
| B | BERLIN | 4 | "I like turtles" |
| C | PARIS | 5 | "It was a dark and stormy night" |
| ... | ... | ... | ... |
(In reality, my dataset is not a language data and looks nothing like this, but this example demonstrates the point well enough.)
I would like to simultaneously create embeddings from scratch (not using pre-existing embeddings at any point) for each of the following:
Authors
Cities
Tweet/Sentences/Documents (EG. 1, 2, 3, 4, 5, etc.)
Words (EG. 'This', 'is', 'usually', ..., 'stormy', 'night', etc.)
Even after reading the coumentation, it doesn't seem clear which 'mode' of starspace training I should be using.
If anyone could help me understand how to interpret the modes to help select the appropriate one, that would be much appreciated.
I would also like to know if there are conditions under which the embeddings generated using one of the modes above, would in some way be equivalent to the embeddings built using a different mode (ignoring the fact that the embeddings would be different because of the non-determinstic nature of the process.)
Thank you

Related

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

Visualising the motion of multiple robots

I am trying to add multiple robot instances and visualising their motions. I tried a couple of ways to do this and I ran into errors/issues. They are as follows:
I tried adding another model instance after the system is created.
parsers::urdf::AddModelInstanceFromUrdfFileToWorld(
FindResourceOrThrow("path/CompassGait.urdf"),
multibody::joints::kRollPitchYaw, tree.get());
parsers::urdf::AddModelInstanceFromUrdfFileToWorld(
FindResourceOrThrow("path/quadrotor.urdf"),
multibody::joints::kRollPitchYaw, tree.get());
As expected, there are two robots which are visible in the visualiser and there are 26 output ports in the system. But i am unable to visualise the required motion by the quadrotor. It seems to be following the x,y,z and roll pitch and yaw derivatives given as an input for the compass gait's output port. Is this an expected behaviour?
I experience a similar thing when I add 2 compass gait models and try to make them follow the same path. Even though I give the ports 14-27 the same inputs as i give to 0-13. The second robot is stuck near the origin while the first one moves fine as expected without any issues.
I needed some help or maybe some examples where I can get a better idea about visualising the motion for multiple robots.
[Updated] Please see note at the bottom.
drake::systems::DrakeVisualizer (which I assume you're using to publish your visualization messages) was designed to be connected to the state_output_port of drake::systems::RigidBodyPlant. According to the documentation for RigidBodyPlant,
The state vector, x, consists of generalized positions followed by generalized velocities.
That is, the generalized positions for all model instances come before the generalized velocities. When working with RigidBodyPlant and DrakeVisualizer this ordering is handled automatically.
From your question, however, I gather that you have separate, specialized systems for your quadrotor and compass-gait models (as per their respective examples). Each of these systems outputs its state as [q, v], where q is the generalized position vector and v is the generalized velocity vector. In that case you will need to use drake::systems::Demultiplexer and drake::systems::Multiplexer to split the outputs of the quadrotor and compass-gait systems and reassemble them in the required order:
+---------------+ +-------------+ q +-------------+
| | | +-------->+ |
| Compass-gait +-->+Demultiplexer| | |
| | | +-----+ v | |
+---------------+ +-------------+ | | |
+----->+ | +-----------------+
| | | | | |
| | | Multiplexer +-->+ DrakeVisualizer |
q| | | | | |
| +-->+ | +-----------------+
+---------------+ +-------------+ | | |
| | | +--+ | |
| Quadrotor +-->+Demultiplexer| | |
| | | +---+---->+ |
+---------------+ +-------------+ v +-------------+
Note: RigidBodyPlant and associated classes are being replaced by drake::multibody::MultibodyPlant and drake::geometry::SceneGraph. See run_quadrotor_lqr.cc for an example of using these new tools with a specialized plant.

How can I concatenate mixed type input into multi layer network with deeplearning4j?

I have a dataset where some features are numerical, some categorical, and some are strings (e.g. description). To give an example, lets say I have three features:
| Number | Type | Comment |
---------------------------------------------------------
| 1.23 | 1 | Some comment, up to 10000 characters |
| 2.34 | 2 | Different comment many words |
...
Can I have all of them as input to a multi-layer network in dl4j, where numerical and categorical would be regular input features, but string comment feature will be processed first as word-series by a simple RNN (e.g. Embedding -> LSTM)? In other words, architecture should look something like this:
"Number" "Type" "Comment"
| | |
| | Embedding
| | |
| | LSTM
| | |
Main Multi-Layer Network
|
Dense
|
...
|
Output
I think in Keras this can be achieved by Concatenate layer. Is there something like this in DL4J?
Dl4j has 99% keras import coverage. We have concatneate layers as well. Take a look at the various vertices. Whatever you can do in keras should be do able in dl4j, save for very specific cases. More here: https://deeplearning4j.org/docs/latest/deeplearning4j-nn-computationgraph You want a MergeVertex.

Behave - Common features between applications, avoiding duplication

I have many applications which I want to test, which have a largely overlapping set of features. Here is an oversimplified example of a scenario I might have:
Given <name> is playing a game,
When they shoot at a <color> target
Then they should <event>
Examples:
| name | color | event |
| Alice | red | hit |
| Alice | blue | miss |
| Bob | red | miss |
| Bob | blue | hit |
| Bob | green | hit |
It's a silly example, but suppose really I have a lot of players with different hit/miss conditions, and I want to run just the scenarios for a given name? Say, I only want to run the tests for Alice. There's still advantage to having all the hit/miss tests in a single Scenario Outline (since, after all, they're all closely related).
One approach would be to just duplicate the test for every name and tag them, so something like:
#Alice
Given Alice is playing a game
When she shoots at a <color> target
Then she should <event>
Examples:
| color | event |
| red | hit |
| blue | miss |
This way I can run behave --tags #Alice, But then I'm repeated the same scenario for every user, and that's a lot of duplication. Is there a good way to still compress all the examples into one scenario - but only selectively run some of them? What's the right approach here?
Version 1.2.5 introduced better ways to distinguish scenario outlines. It is now possible to uniquely distinguish them and thus select a unique scenario generated from an outline with --name= at the command line. For instance, suppose the following feature file:
Feature: test
Scenario Outline: test
Given <name> is playing a game,
When they shoot at a <color> target
Then they should <event>
Examples:
| name | color | event |
| Alice | red | hit |
| Alice | blue | miss |
| Bob | red | miss |
| Bob | blue | hit |
| Bob | green | hit |
Let's say I want to run only the test for Bob, red, miss. It is in the first table, 3rd row. So:
behave --name="#1.3"
will select this test. In version 1.2.5 and subsequent versions. A generated scenario gets a name which includes "#<table number>.<row number>" where <table number> is the number of the table (starting from 1) and <row number> is the number of the row.
This won't easily allow you to select all scenarios that pertain to a single user. However, you can achieve it in another way. You can split your examples in two:
Examples: Alice
| name | color | event |
| Alice | red | hit |
| Alice | blue | miss |
Examples: Bob
| name | color | event |
| Bob | red | miss |
| Bob | blue | hit |
| Bob | green | hit |
The table names will appear in the generated scenario names and you could ask behave to run all the tests associated with one table:
behave --name="Alice"
I do not know of a way to access the example name in steps and thus get rid of the first column.
The full set of details is in the release notes for 1.2.5.

Specflow Feature-level Templates

I'm trying to execute an entire SpecFlow Feature using three different UserID/Password combinations. I'm struggling to find a way to do this in the .feature file without having to introduce any loops in the MSTest.
On the Scenario level I'm doing this:
Scenario Template: Verify the addition functionality
Given the value <x>
And the value <y>
When I add the values together
Then the result should be <z>
Examples:
|x|y|z|
|1|2|3|
|2|2|4|
|2|3|5|
Is there a way to do a similar table at the feature level that will cause the entire feature to be executed for each row in the table?
Is there other functionality available to do the same thing?
I don't think the snippet you have is working is it? I've updated the below with the corrections I think you need (as Fresh also points out) and a couple of possible improvements.
With this snippet, you'll see that the scenario is run for each line in the table of examples. So, the first test will connect with 'Bob' and 'password', ask your tool to add 1 and 2 and check that the answer is 3.
I've also added an ID column - that is optional but I find it much easier to read the results with an ID number.
Scenario Outline: Verify the addition functionality
Given I am connecting with <username> and <password>
When I add <x> and <y> together
Then the result should be <total>
Examples:
| ID | username | password | x | y | total |
| 1 | Bob | password | 1 | 2 | 3 |
| 2 | Helen | Hello123 | 1 | 2 | 3 |
| 3 | Dave | pa£sword | 1 | 2 | 3 |
| 4 | Bob | password | 2 | 3 | 5 |
| 5 | Helen | Hello123 | 2 | 3 | 5 |
| 6 | Dave | pa£sword | 2 | 3 | 5 |
| 7 | Bob | password | 2 | 2 | 4 |
| 8 | Helen | Hello123 | 2 | 2 | 4 |
| 9 | Dave | pa£sword | 2 | 2 | 4 |
"Is there a way to do a similar table at the feature level that will
cause the entire feature to be executed for each row in the table?"
No, Specflow (and indeed the Gherkin language) doesn't have a concept of a "Feature Outline" i.e. a way of specifying a collection of features which should be run in their entirety.
You could possibly achiever what you are looking for by making use of Specflow tags to tag related scenarios. You could then use your test runner to trigger the testing of all the scenarios with that tag e.g.
#related
Scenario: A
Given ...etc...
#related
Scenario: B
Given ...etc.
SpecFlow+ Runner (aka SpecRun, http://www.specflow.org/plus/), provides infrastructure (called test targets) to be able to run the same test suite (or selected scenarios) with different settings. With this you can solve problems like the one you have mentioned. It can be also used to run the same web tests with different browsers, etc. Check this screencast for details: http://www.youtube.com/watch?v=RZYV4Dvhw3w

Resources