Splitting complex PDF files using Watson Document Conversion Service - watson

We are implementing Question & Answering System using Watson Discovery Service(WDS). We required each answer unit available in single document. We have complex PDF files as corpus. The PDF files contains two column data, tables and images. Instead ingesting whole PDF files as corpus to WDS and using passage retrieval we are using Watson Document Conversion Service(WDC) to split each PDF file into answer units and later we are ingesting there answer units into WDS.
We are facing two issues with Watson Document Conversion service for complex PDF splitting.
We are expecting each heading as title and corresponding text as data(answer). However it is splitting each chapter as single answer unit. Is there any way to split the two column document based on the heading?
In case the input PDF file contains table the document conversion service reading structured data available in PDF file as simple text(missing table formatting). Is there any way to read structured data from PDF to answer unit?

I would recommend that you first convert your PDF to normalized HTML by using this setting:
"conversion_target": "normalized_html"
and inspect the generated HTML. Look for the places where headings (<h1>, <h2>, ..., <h6>) are detected. Those are the tags that will be used to split by answer units when you switch back to answer_units.
The reason you are currently seeing each chapter being split as an answer unit is because each chapter probably starts with a heading, but no headings are detected within each chapter.
In order to generate more answer units, you will need to tweak the PDF input configurations as described here, so that more headings are generated from the PDF to HTML conversion step and hence more answer units are generated.
For example, the following configuration will detect headings at 6 different levels, based on certain font characteristics for each level:
{
"conversion_target": "normalized_html",
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
You can start with a configuration like this and keep tweaking it until the produced normalized HTML contains the headings at the places that you expect the answer units to be. Then, take the tweaked configuration, switch to answer_units and put it all together:
{
"conversion_target": "answer_units",
"answer_units": {
"selector_tags": ["h1", "h2", "h3", "h4", "h5", "h6"]
},
"pdf": {
"heading": {
"fonts": [
{"level": 1, "min_size": 24},
{"level": 2, "min_size": 18, "max_size": 23, "bold": true},
{"level": 3, "min_size": 14, "max_size": 17, "italic": false},
{"level": 4, "min_size": 12, "max_size": 13, "name": "Times New Roman"},
{"level": 5, "min_size": 10, "max_size": 12, "bold": true},
{"level": 6, "min_size": 9, "max_size": 10, "bold": true}
]
}
}
}
Regarding your second question about tables, unfortunately there is no way to convert table content into answer units. As explained above, answer unit generation is based on heading detection. That being said, if there is a table between two detected headings, that table will be part of the answer unit as any other content between the two headings.

Related

highchartr: Synchronize point and line colors and legend

I'm using the highchartr package to create an interactive chart.
My chart has lines on it corresponding to different series. In addition, I would like to have shapes at certain points on the lines.
Its very easy to get the points in the right place. However, I would like to map the point color to the line it is associated with. And when the user clicks on the legend entry for the line, I'd like the associated points to be toggled as well.
The code looks like this:
highchart() %>%
hc_add_series(
type="line",
marker=list(enabled=F),
data=input_data,
mapping=hcaes(x=x, y=y, group=series_name)
) %>%
hc_add_series(
type="point",
data=input_data %>% filter(! is.na(marker)),
mapping=hcaes(x=x, y=y, color=series_name, fill=series_name, group=series_name, shape=marker)
)
The result gets the points in the right place. But the point color is on a different color mapping from the lines. Clicking on the entry for the line in the legend toggles only the line - the points show up as separate entries by series_name.
What
What can I do so:
- The points and lines share the same color mapping
- The points and lines can be toggled together by clicking on the line in the legend
- The points show up separately in the legend based on their shape rather than their color?
Thanks!
Generally, it can be achieved in at least few different ways. It all depends on your data which you haven't provided (I created a sample data).
Additionally, I will provide all the examples in jsFiddle (JavaScript) because it is faster to explain something that way with a quick online example.
The final answer will contain R code (maybe with some custom JavaScript if needed, but all will be reproducible in R.
First of all, your assumption that you need a separate series is wrong and causes problems. If you want markers on your line with the same color and you want to toggle them together on legend click, then you don't need separate series - one series with markers enabled on some points is enough, see this example: https://jsfiddle.net/BlackLabel/s24rk9x7/
In this case, the R data needs to be defined properly.
If you don't want to keep it simple as described above, you can keep lines and markers as separate series as in your original question.
In this case, you can use series.linkedTo property to connect your "point" series to line series (BTW in Highcharts there is no something like "point" series type, it is "scatter" series type. Another reason why your code is wrong and is not working and you got unvoted), but there is a problem with it in Highcharter - doesn't work, seems like a bug and should be reported on Highcharter GitHub repo.
This is a JavaScript version which works fine: https://jsfiddle.net/BlackLabel/3mtdfqLo/
In this example, if you want to keep markers and line series in the same color, you can define colors manually or you can write some custom code (like I did) that will change the color for you automatically.
And this is the same R version which should work, but is not:
library(highcharter)
highchart() %>%
hc_add_series(
data=list(4, 3, 5, 6, 2, 3)
) %>%
hc_add_series(
data=list(14, 13, 15, 16, 12, 13),
id="first"
) %>%
hc_add_series(
data=list(10, 8, 6, 2, 5, 12),
id="second"
) %>%
hc_add_series(
type="scatter",
linkedTo="first",
data=list(list(1, 3), list(2, 5))
) %>%
hc_add_series(
type="scatter",
linkedTo="second",
data=list(list(1, 13), list(2, 15), list(3, 16))
) %>%
hc_plotOptions(
line = list(marker=list(enabled=F))
)
There is probably something wrong with hc_add_series function.
As a workaround, you can write it all as a custom JavaScript code, which (again) works fine:
library(highcharter)
highchart() %>%
hc_plotOptions(
line = list(marker=list(enabled=F))
) %>%
hc_chart(
events = list(load = JS("function() {
this.addSeries({
data: [4, 3, 5, 6, 2, 3],
id: 'first'
});
this.addSeries({
data: [14, 13, 15, 16, 12, 13],
id: 'second'
});
this.addSeries({
data: [10, 8, 6, 2, 5, 12]
});
this.addSeries({
type: 'scatter',
linkedTo: 'first',
data: [[1, 3], [2, 5]]
});
this.addSeries({
type: 'scatter',
linkedTo: 'second',
data: [[1, 13], [2, 15], [3, 16]]
});
}")))
Of course, last examples don't contain functionality that changes colors - you can copy it from the jsFiddle above.

Same vs Different Target Values for each sample for Regression in Machine Learning

I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
and
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:
yPredicted = [40, 40, 40, 40]
Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.
But when I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]
And:
xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]
The predicted values are:
yPredicted = [987, 345, 435, 232]
which gives very good correlations.
My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?
Why the network does not work when having repeated values for a large number of inputs?
Most certainly, this is not the reason why your network does not perform well in the first dataset shown.
(You have not provided any code, so inevitably this will be a qualitative answer)
Looking closely at your first dataset:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
it's not difficult to conclude that we have a monotonic (increasing) function y(x) (it is not strictly monotonic, but it is monotonic nevertheless over the whole x range provided).
Given that, your model has absolutely no way of "knowing" that, for x > 12, the qualitative nature of the function changes significantly (and rather abruptly), as apparent from your test set:
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
and you should not expect it to know or "guess" it in any way (despite what many people may seem to believe, NN are not magic).
Looking closely to your second dataset, you will realize that this is not the case with it, hence the network is unsurprisingly able to perform better here; when doing such experiments, it is very important to be sure that we are comparing apples to apples, and not apples to oranges.
Another general issue with your attempts here and your question is the following: neural nets are not good at extrapolation, i.e. predicting such numerical functions outside the numeric domain on which they have been trained. For details, please see own answer at Is deep learning bad at fitting simple non linear functions outside training scope?
A last unusual thing here is your use of correlation; not sure why you choose to do this, but you may be interested to know that, in practice, we never assess model performance using a correlation measure between predicted outcomes and ground truth - we use measures such as the mean squared error (MSE) instead (for regression problems, such as yours here).

LSTM, pattern and noise gap

I want to find sequence patterns in a time series with random noise gap.
For example, this is the pattern I wan to find:
1, 2, 3, 4
But, my samples are:
*1*, 10, *2*, *3*, 11, 12, *4*
*1*, *2*, 10, 14, 15, *3*, 10, 13, *4*
10, *1*, 10, 10, 10, *2*, 11, 12, *3*, *4*
I don't know that the "good" elements are 1, 2, 3 and 4.
I started with a LSTM decoder, but "the noise" hide the good elements. For example, with the 3 samples, I get:
*1*, 10, 13, 10, ...
and 2, 3 and 4 are hidden
Have you an idea to find those patterns ?
Thanks.
Frédéric
As a starting point you can use a sequence-to-sequence (seq2seq) model. The linked repo has nice explanation how these models work and what type of problems they cover. The crucial point would be how to encode your sequence. Often they are encoded as one-hot vectors. So if you have a fixed upper bound on the number of distinct numbers/items in your sequence you can use it.
Instead of generating a new sequence without noise from the original one, you can also try to classify each point as noise or not and eliminate those classified as noise as your output. Something along the lines of:
seq = Input(shape=(timesteps, features))
hidden = LSTM(HIDDEN_UNITS, return_sequences=True)(seq)
out = TimeDistributed(Dense(1, activation='sigmoid'))(hidden)
You will have to know before hand whether each data point is noise or not.

Would Sklearn GridSearchCV go through all the possible default options of the estimator's parameters?

Algorithms in scikit-learn might have some parameters that have default range of options,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
and the parameter has a default value "auto", with the following options: algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
My question is, when using **GridSearchCV** to find the best set of values for the parameters of an algorithm, would GridSearchCV go though all the default options of a parameter even though I don't add it to the parameter_list?
For example, I want to use **GridSearchCV** to find the best parameter values for **kNN**, I need to examine the n_neighbors and algorithm parameters, is it possible that I just need to pass the values with no as below (because the algorithm parameter has default options),
parameter_list = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
or, I have to specify all the options that I want to examine?
parameter_list = {
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
Thanks.
No, You are misunderstanding about the parameter default and available option.
Looking at the documentation of KNeighborsClassifier, the parameter algorithm is an optional parameter (i.e. you may and may not specify it during constructor of KneighborsClassifier).
But if you decide to specify it, then it has options available: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. It means that you can give the value only from these given options for algorithm and cannot use any other string to specify for algorithm. The default option is 'auto', means that if you dont supply any value, then it will internally use 'auto'.
Case 1:- KNeighborsClassifier(n_neighbors=3)
Here since no value for algorithm has been specified, so it will by default use algorithm='auto'.
Case 2:- KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
Here as the algorithm has been specified, so it will use 'kd_tree'
Now, GridSearchCV will only pass those parameters to the estimator which are specified in the param_grid. So in your case when you use the first parameter_list from the question, then it will give only n_neighbors to the estimator and algorithm will have only default value ('auto').
If you use the second parameter_list, then both n_neighbors and algorithm will be passed on to the estimator.

Highcharts: Disconnect line when there's no data for position

in my project I'm using highcharts/stockcharts. See my JS-fiddle example for a simple graph with X-axis 1 to 10, but no data for "position 9". I want no line to be drawn between 8 and 10, since there is no data for 9.
I've been playing with the connectNulls option, but that only works when providing null as value for position 9. Since Highcharts figures out the intervals by itself I hoped it would recognise on its own that there's no data for that position. Is there any way to make Highcharts not draw a line between 8 and 10 without specifying null for position 9?
Thanks in advance for your replies.
http://jsfiddle.net/bnqhuqt3/
You can not cut a line in a serie. You need to define a new data for series
series: [{
data: [
[1, 98.87],
[2, 98.45],
[3, 98.52],
[4, 99.34],
[5, 98.56],
[6, 98.61],
[7, 98.12],
[8, 98.03],
]
}, {
data: [
[10, 0],
[11, 150]
]
}]
Fiddle Example

Resources