How can I concatenate mixed type input into multi layer network with deeplearning4j? - machine-learning

I have a dataset where some features are numerical, some categorical, and some are strings (e.g. description). To give an example, lets say I have three features:
| Number | Type | Comment |
---------------------------------------------------------
| 1.23 | 1 | Some comment, up to 10000 characters |
| 2.34 | 2 | Different comment many words |
...
Can I have all of them as input to a multi-layer network in dl4j, where numerical and categorical would be regular input features, but string comment feature will be processed first as word-series by a simple RNN (e.g. Embedding -> LSTM)? In other words, architecture should look something like this:
"Number" "Type" "Comment"
| | |
| | Embedding
| | |
| | LSTM
| | |
Main Multi-Layer Network
|
Dense
|
...
|
Output
I think in Keras this can be achieved by Concatenate layer. Is there something like this in DL4J?

Dl4j has 99% keras import coverage. We have concatneate layers as well. Take a look at the various vertices. Whatever you can do in keras should be do able in dl4j, save for very specific cases. More here: https://deeplearning4j.org/docs/latest/deeplearning4j-nn-computationgraph You want a MergeVertex.

Related

machine learning model different inputs

i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.
Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further

Can logistic regression be used for variables containing lists?

I'm pretty new into Machine Learning and I was wondering if certain algorithms/models (ie. logistic regression) can handle lists as a value for their variables. Until now I've always used pretty standard datasets, where you have a couple of variables, associated values and then a classification for those set of values (view example 1). However, I now have a similar dataset but with lists for some of the variables (view example 2). Is this something logistic regression models can handle, or would I have to do some kind of feature extraction to transform this dataset into just a normal dataset like example 1?
Example 1 (normal):
+---+------+------+------+-----------------+
| | var1 | var2 | var3 | classification |
+---+------+------+------+-----------------+
| 1 | 5 | 2 | 526 | 0 |
| 2 | 6 | 1 | 686 | 0 |
| 3 | 1 | 9 | 121 | 1 |
| 4 | 3 | 11 | 99 | 0 |
+---+------+------+------+-----------------+
Example 2 (lists):
+-----+-------+--------+---------------------+-----------------+--------+
| | width | height | hlines | vlines | class |
+-----+-------+--------+---------------------+-----------------+--------+
| 1 | 115 | 280 | [125, 263, 699] | [125, 263, 699] | 1 |
| 2 | 563 | 390 | [11, 211] | [156, 253, 399] | 0 |
| 3 | 523 | 489 | [125, 255, 698] | [356] | 1 |
| 4 | 289 | 365 | [127, 698, 11, 136] | [458, 698] | 0 |
| ... | ... | ... | ... | ... | ... |
+-----+-------+--------+---------------------+-----------------+--------+
To provide some additional context on my specific problem. I'm attempting to represent drawings. Drawings have a width and height (regular variables) but drawings also have a set of horizontal and vertical lines for example (represented as a list of their coordinates on their respective axis). This is what you see in example 2. The actual dataset I'm using is even bigger, also containing variables which hold lists containing the thicknesses for each line, lists containing the extension for each line, lists containing the colors of the spaces between the lines, etc. In the end I would like to my logistic regression to pick up on what result in nice drawings. For example, if there are too many lines too close the drawing is not nice. The model should pick up itself on these 'characteristics' of what makes a nice and a bad drawing.
I didn't include these as the way this data is setup is a bit confusing to explain and if I can solve my question for the above dataset I feel like I can use the principe of this solution for the remaining dataset as well. However, if you need additional (full) details, feel free to ask!
Thanks in advance!
No, it cannot directly handle that kind of input structure. The input must be a homogeneous 2D array. What you can do, is come up with new features that capture some of the relevant information contained in the lists. For instance, for the lists that contain the coordinates of the lines along an axis (other than the actual values themselves), one could be the spacing between lines, or the total amount of lines or also some statistics such as the mean location etc.
So the way to deal with this is through feature engineering. This is in fact, something that has to be dealt with in most cases. In many ML problems, you may not only have variables which describe a unique aspect or feature of each of the data samples, but also many of them might be aggregates from other features or sample groups, which might be the only way to go if you want to consider certain data sources.
Wow, great question. I have never consider this, but when I saw other people's responses, I would have to concur, 100%. Convert the lists into a data frame and run your code on that object.
import pandas as pd
data = [["col1", "col2", "col3"], [0, 1, 2],[3, 4, 5]]
column_names = data.pop(0)
df = pd.DataFrame(data, columns=column_names)
print(df)
Result:
col1 col2 col3
0 0 1 2
1 3 4 5
You can easily do any multi regression on the fields/features of the data frame and you'll get what you need. See the link below for some ideas of how to get started.
https://pythonfordatascience.org/logistic-regression-python/
Post back if you have additional questions related to this. Or, start a new post if you have similar, but unrelated, questions.

Which Starspace training mode to use for multi-level embeddings

I am using the StarSpace embedding framework for the first time and am unclear on the "modes" that it provides for training and the differences between them.
The options are:
wordspace
sentencespace
articlespace
tagspace
docspace
pagespace
entityrelationspace/graphspace
Let's say I have a dataset that looks like this:
| Author | City | Tweet_ID | Tweet_contents |
|:-------|:-------|:----------|:-----------------------------------|
| A | NYC | 1 | "This is usually a short sentence" |
| A | LONDON | 2 | "Another short sentence" |
| B | PARIS | 3 | "Check out this cool track" |
| B | BERLIN | 4 | "I like turtles" |
| C | PARIS | 5 | "It was a dark and stormy night" |
| ... | ... | ... | ... |
(In reality, my dataset is not a language data and looks nothing like this, but this example demonstrates the point well enough.)
I would like to simultaneously create embeddings from scratch (not using pre-existing embeddings at any point) for each of the following:
Authors
Cities
Tweet/Sentences/Documents (EG. 1, 2, 3, 4, 5, etc.)
Words (EG. 'This', 'is', 'usually', ..., 'stormy', 'night', etc.)
Even after reading the coumentation, it doesn't seem clear which 'mode' of starspace training I should be using.
If anyone could help me understand how to interpret the modes to help select the appropriate one, that would be much appreciated.
I would also like to know if there are conditions under which the embeddings generated using one of the modes above, would in some way be equivalent to the embeddings built using a different mode (ignoring the fact that the embeddings would be different because of the non-determinstic nature of the process.)
Thank you

Mapping timeseries+static information into an ML model (XGBoost)

So lets say I have multiple probs, where one prob has two input DataFrames:
Input:
One constant stream of data (e.g. from a sensor) Second step: Multiple streams from multiple sensors
> df_prob1_stream1
timestamp | ident | measure1 | measure2 | total_amount |
----------------------------+--------+--------------+----------+--------------+
2019-09-16 20:00:10.053174 | A | 0.380 | 0.08 | 2952618 |
2019-09-16 20:00:00.080592 | A | 0.300 | 0.11 | 2982228 |
... (1 million more rows - until a pre-defined ts) ...
One static DataFrame of information, mapped to an unique identifier called ident, which needs to be assigned to the ident column in each df_probX_streamX in order to let the system recognize, that this data is related.
> df_global
ident | some1 | some2 | some3 |
--------+--------------+----------+--------------+
A | LARGE | 8137 | 1 |
B | SMALL | 1234 | 2 |
Output:
A binary classifier [0,1]
So how can I suitable train XGBoost to be able to make the best usage of one timeseries DataFrame in combination with one static DataFrame (containg additional context information) in one prob? Any help would be appreciated.

Visualising the motion of multiple robots

I am trying to add multiple robot instances and visualising their motions. I tried a couple of ways to do this and I ran into errors/issues. They are as follows:
I tried adding another model instance after the system is created.
parsers::urdf::AddModelInstanceFromUrdfFileToWorld(
FindResourceOrThrow("path/CompassGait.urdf"),
multibody::joints::kRollPitchYaw, tree.get());
parsers::urdf::AddModelInstanceFromUrdfFileToWorld(
FindResourceOrThrow("path/quadrotor.urdf"),
multibody::joints::kRollPitchYaw, tree.get());
As expected, there are two robots which are visible in the visualiser and there are 26 output ports in the system. But i am unable to visualise the required motion by the quadrotor. It seems to be following the x,y,z and roll pitch and yaw derivatives given as an input for the compass gait's output port. Is this an expected behaviour?
I experience a similar thing when I add 2 compass gait models and try to make them follow the same path. Even though I give the ports 14-27 the same inputs as i give to 0-13. The second robot is stuck near the origin while the first one moves fine as expected without any issues.
I needed some help or maybe some examples where I can get a better idea about visualising the motion for multiple robots.
[Updated] Please see note at the bottom.
drake::systems::DrakeVisualizer (which I assume you're using to publish your visualization messages) was designed to be connected to the state_output_port of drake::systems::RigidBodyPlant. According to the documentation for RigidBodyPlant,
The state vector, x, consists of generalized positions followed by generalized velocities.
That is, the generalized positions for all model instances come before the generalized velocities. When working with RigidBodyPlant and DrakeVisualizer this ordering is handled automatically.
From your question, however, I gather that you have separate, specialized systems for your quadrotor and compass-gait models (as per their respective examples). Each of these systems outputs its state as [q, v], where q is the generalized position vector and v is the generalized velocity vector. In that case you will need to use drake::systems::Demultiplexer and drake::systems::Multiplexer to split the outputs of the quadrotor and compass-gait systems and reassemble them in the required order:
+---------------+ +-------------+ q +-------------+
| | | +-------->+ |
| Compass-gait +-->+Demultiplexer| | |
| | | +-----+ v | |
+---------------+ +-------------+ | | |
+----->+ | +-----------------+
| | | | | |
| | | Multiplexer +-->+ DrakeVisualizer |
q| | | | | |
| +-->+ | +-----------------+
+---------------+ +-------------+ | | |
| | | +--+ | |
| Quadrotor +-->+Demultiplexer| | |
| | | +---+---->+ |
+---------------+ +-------------+ v +-------------+
Note: RigidBodyPlant and associated classes are being replaced by drake::multibody::MultibodyPlant and drake::geometry::SceneGraph. See run_quadrotor_lqr.cc for an example of using these new tools with a specialized plant.

Resources