I have at hand, a dataset of GPS logs containing GPS speeds as well. Here's how the dataset looks like:
id | gpstime | lat | lon | speed
--------+------------+------------+------------+---------
157934 | 1530099776 | 41.1825026 | -8.5996864 | 3.40901
157934 | 1530099777 | 41.1825114 | -8.599722 | 3.43062
157934 | 1530099778 | 41.1825233 | -8.5997594 | 3.45739
157934 | 1530099779 | 41.1825374 | -8.5997959 | 3.40025
157934 | 1530099780 | 41.1825519 | -8.5998337 | 3.41673
(5 rows)
Now I want to compute the bearing change, for each point with respect to the true north.
But I have these questions I am yet to find answers to:
Based on my reading, I come across the formula (as in this answer):
Bearing = atan(y,x)
where x and y are the quantities
y = sin(Blon-Alon) * cosBlat
x = cosAlat * sinBlat -sinAlat * cosBlat * cos(Blon-Alon)
respectively for points A and B. Then from another source, the formula here, the formula is written:
Bearing = atan2(y,x)
So I'm confused, which of the formula should I use?
lat and lon should be converted from degrees to radian before passing to quantities x and y. Being that the values of lon in my dataset are negatives, should I take the absolute value of each?
I think for GPS tracks this would be an overkill. In case the distance between two point are not to big (let's say a few hundreds of meters) I assume this simplified calculation is sufficient.
The latitude/longitude differences are app.
Δlat = 111km * (lat1 - lat2)
Δlon = 111km * cos(lat) * (lon1 - lon2)
So bearing would be
bearing = atan(Δlon / Δlat) * 180/π
bearing = atan(cos(lat) * (lon1 - lon2) / (lat1 - lat2)) * 180/ACOS(-1)
for lat use either lat1 or lat2 or the middle if you like.
lat = (lat1 + lat2)/2 * π/180 = (lat1 + lat2)/2 * ACOS(-1)/180
Consider Δlat or Δlat could be 0
I am using Spark ML's LinearSVC in a binary classification model. The transform method creates two columns, prediction and rawPrediction. Spark's docs don't provide any way of interpreting the rawPrediction column for this particular classifier. This question has been asked and answered for other classifiers, but not specifically for LinearSVC.
The relevant column from my predictions dataframe:
+------------------------------------------+
|rawPrediction |
+------------------------------------------+
|[0.8553257800650063,-0.8553257800650063] |
|[0.4230977574196645,-0.4230977574196645] |
|[0.49814263303537865,-0.49814263303537865]|
|[0.9506355050332026,-0.9506355050332026] |
|[0.5826887000450813,-0.5826887000450813] |
|[1.057222808292026,-1.057222808292026] |
|[0.5744214192446275,-0.5744214192446275] |
|[0.8738081933835614,-0.8738081933835614] |
|[1.418173816502859,-1.418173816502859] |
|[1.0854125533426737,-1.0854125533426737] |
+------------------------------------------+
Clearly this isn't simply the probability of belonging to each class. What is it?
Edit: Since the input code has been requested, here's a model built on a subset of features in the original dataset. Fitting any data with Spark's LinearSVC will produce this column.
var df = sqlContext
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/full_frame_20180716.csv")
var assembler = new VectorAssembler()
.setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length",
"longest_word_length", "total_words", "repeated_exact_words",
"repeated_bigrams", "repeated_lemmatized_words",
"repeated_lemma_bigrams"))
.setOutputCol("features")
df = assembler.transform(df)
var Array(train, test) = df.randomSplit(Array(.8,.2), 42)
var supvec = new LinearSVC()
.setLabelCol("written_before_2004")
.setMaxIter(10)
.setRegParam(0.001)
var supvecModel = supvec.fit(train)
var predictions = supvecModel.transform(test)
predictions.select("rawPrediction").show(20, false)
Output:
+----------------------------------------+
|rawPrediction |
+----------------------------------------+
|[1.1502868455791242,-1.1502868455791242]|
|[0.853488887006264,-0.853488887006264] |
|[0.8064994501574174,-0.8064994501574174]|
|[0.7919862003563363,-0.7919862003563363]|
|[0.847418035176922,-0.847418035176922] |
|[0.9157433788236442,-0.9157433788236442]|
|[1.6290888181913814,-1.6290888181913814]|
|[0.9402461917731906,-0.9402461917731906]|
|[0.9744052798627367,-0.9744052798627367]|
|[0.787542624053347,-0.787542624053347] |
|[0.8750602657901001,-0.8750602657901001]|
|[0.7949414037722276,-0.7949414037722276]|
|[0.9163545832998052,-0.9163545832998052]|
|[0.9875454213431247,-0.9875454213431247]|
|[0.9193015302646135,-0.9193015302646135]|
|[0.9828623328048487,-0.9828623328048487]|
|[0.9175976004208621,-0.9175976004208621]|
|[0.9608750388820302,-0.9608750388820302]|
|[1.029326217566756,-1.029326217566756] |
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+
only showing top 20 rows
It is (-margin, margin).
override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}
As it is mention by arpad, it is the margin.
And the margin is:
margin = coefficients * feature + intercept
or
y = w * x + b
If you divide the margin by the norm of the coefficients, you will get the distance to the hyperplane for each data point.
I would like to parse the Wikipedia power plant lists, which contain the {{Location map}} template. In my example I'm using the German translation, but this shouldn't change the basic process.
How can I get out the label=, lat=, lon= and region= parameters from such code?
Probably this is nothing for a html parser like BeautifulSoup, but rather awk?
{{ Positionskarte+
| Tadschikistan
| maptype = relief
| width = 600
| float = right
| caption =
| places =
{{ Positionskarte~
| Tadschikistan
| label = <small>[[Talsperre Baipasa|Baipasa]]</small>
| marktarget =
| mark = Blue pog.svg
| position = right
| lat = 38.267584
| long = 69.123906
| region = TJ
| background = #FEFEE9
}}
{{ Positionskarte~
| Tadschikistan
| label = <small>[[Kraftwerk Duschanbe|Duschanbe]]</small>
| marktarget =
| mark = Red pog.svg
| position = left
| lat = 38.5565
| long = 68.776
| region = TJ
| background = #FEFEE9
}}
...
}}
Thanks in advance!
Just extract information with regular expressions.
For example like this (PHP)
$k = "{{ Positionskarte+
| Tadschikistan
| maptype = relief
| width = 600
| float = right
| caption =
| places =
{{ Positionskarte~
| Tadschikistan
| label = <small>[[Talsperre Baipasa|Baipasa]]</small>
| marktarget =
| mark = Blue pog.svg
| position = right
| lat = 38.267584
| long = 69.123906
| region = TJ
| background = #FEFEE9
}}
{{ Positionskarte~
| Tadschikistan
| label = <small>[[Kraftwerk Duschanbe|Duschanbe]]</small>
| marktarget =
| mark = Red pog.svg
| position = left
| lat = 38.5565
| long = 68.776
| region = TJ
| background = #FEFEE9
}}
}}";
$items = explode("Positionskarte~", $k);
$result = [];
foreach ($items as $item) {
$info = [];
$pattern1 = '/label\s+=\s+(.+)/';
preg_match($pattern1, $item, $matches);
if (!empty($matches)) {
$info['label'] = $matches[1];
}
$pattern2 = '/lat\s+=\s+(.+)/';
preg_match($pattern2, $item, $matches);
if (!empty($matches)) {
$info['lat'] = $matches[1];
}
$pattern3 = '/long\s+=\s+(.+)/';
preg_match($pattern3, $item, $matches);
if (!empty($matches)) {
$info['long'] = $matches[1];
}
$pattern4 = '/region\s+=\s+(.+)/';
preg_match($pattern4, $item, $matches);
if (!empty($matches)) {
$info['region'] = $matches[1];
}
if(!empty($info)) {
$result[] = $info;
}
}
var_dump($result);
I'm building a web app to help students with learning Maths.
The app needs to display Maths content that comes from LaTex files.
These Latex files render (beautifully) to pdf that I can convert cleanly to svg thanks to pdf2svg.
The (svg or png or whatever image format) image looks something like this:
_______________________________________
| |
| 1. Word1 word2 word3 word4 |
| a. Word5 word6 word7 |
| |
| ///////////Graph1/////////// |
| |
| b. Word8 word9 word10 |
| |
| 2. Word11 word12 word13 word14 |
| |
|_______________________________________|
Real example:
The web app intent is to manipulate and add content to this, leading to something like this:
_______________________________________
| |
| 1. Word1 word2 | <-- New line break
|_______________________________________|
| |
| -> NewContent1 |
|_______________________________________|
| |
| word3 word4 |
|_______________________________________|
| |
| -> NewContent2 |
|_______________________________________|
| |
| a. Word5 word6 word7 |
|_______________________________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| |
| -> NewContent3 |
|_______________________________________|
| |
| b. Word8 word9 word10 |
|_______________________________________|
| |
| 2. Word11 word12 word13 word14 |
|_______________________________________|
Example:
A large single image cannot give me the flexibility to do this kind of manipulations.
But if the image file was broken down into smaller files which hold single words and single Graphs I could do these manipulations.
What I think I need to do is detect whitespace in the image, and slice the image into multiple sub-images, looking something like this:
_______________________________________
| | | | |
| 1. Word1 | word2 | word3 | word4 |
|__________|_______|_______|____________|
| | | |
| a. Word5 | word6 | word7 |
|_____________|_______|_________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| | | |
| b. Word8 | word9 | word10 |
|_____________|_______|_________________|
| | | | |
| 2. Word11 | word12 | word13 | word14 |
|___________|________|________|_________|
I'm looking for a way to do this.
What do you think is the way to go?
Thank you for your help!
I would use horizontal and vertical projection to first segment the image into lines, and then each line into smaller slices (e.g. words).
Start by converting the image to grayscale, and then invert it, so that gaps contain zeros and any text/graphics are non-zero.
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
Calculate horizontal projection -- mean intensity per row, using cv2.reduce, and flatten it to a linear array.
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
Now find the row ranges for all the contiguous gaps. You can use the function provided in this answer.
row_gaps = zero_runs(row_means)
Finally calculate the midpoints of the gaps, that we will use to cut the image up.
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
You end up with something like this situation (gaps are pink, cutpoints red):
Next step would be to process each identified line.
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
Calculate the vertical projection (average intensity per column), find the gaps and cutpoints. Additionally, calculate gap sizes, to allow filtering out the small gaps between individual letters.
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
Filter the cutpoints.
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
And create a list of bounding boxes for each segment.
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
Now you end up with something like this (again gaps are pink, cutpoints red):
Now you can cut up the image. I'll just visualize the bounding boxes found:
The full script:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
def plot_horizontal_projection(file_name, img, projection):
fig = plt.figure(1, figsize=(12,16))
gs = gridspec.GridSpec(1, 2, width_ratios=[3,1])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(projection, np.arange(img.shape[0]), 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([0.0, 255.0])
plt.ylim([-0.5, img.shape[0] - 0.5])
ax.invert_yaxis()
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def plot_vertical_projection(file_name, img, projection):
fig = plt.figure(2, figsize=(12, 4))
gs = gridspec.GridSpec(2, 1, height_ratios=[1,5])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(np.arange(img.shape[1]), projection, 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([-0.5, img.shape[1] - 0.5])
plt.ylim([0.0, 255.0])
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def visualize_hp(file_name, img, row_means, row_cutpoints):
row_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
row_highlight[row_means == 0, :, :] = [255,191,191]
row_highlight[row_cutpoints, :, :] = [255,0,0]
plot_horizontal_projection(file_name, row_highlight, row_means)
def visualize_vp(file_name, img, column_means, column_cutpoints):
col_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
col_highlight[:, column_means == 0, :] = [255,191,191]
col_highlight[:, column_cutpoints, :] = [255,0,0]
plot_vertical_projection(file_name, col_highlight, column_means)
# From https://stackoverflow.com/a/24892274/3962537
def zero_runs(a):
# Create an array that is 1 where a is 0, and pad each end with an extra 0.
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
# Runs start and end where absdiff is 1.
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
row_gaps = zero_runs(row_means)
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
visualize_hp("article_hp.png", img, row_means, row_cutpoints)
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
visualize_vp("article_vp_%02d.png" % n, line, column_means, filtered_cutpoints)
result = img.copy()
for bounding_box in bounding_boxes:
cv2.rectangle(result, bounding_box[0], bounding_box[1], (255,0,0), 2)
cv2.imwrite("article_boxes.png", result)
The image is top quality, perfectly clean, not skewed, well separated characters. A dream !
First perform binarization and blob detection (standard in OpenCV).
Then cluster the characters by grouping those with an overlap in the ordinates (i.e. facing each other in a row). This will naturally isolate the individual lines.
Now in every row, sort the blobs left-to-right and cluster by proximity to isolate the words. This will be a delicate step, because the spacing of characters within a word is close to the spacing between distinct words. Don't expect perfect results. This should work better than a projection.
The situation is worse with italics as the horizontal spacing is even narrower. You may have to also look at the "slanted distance", i.e. find the lines that tangent the characters in the direction of the italics. This can be achieved by applying a reverse shear transform.
Thanks to the grid, the graphs will appear as big blobs.
Suppose I have a set of hypotesys H = {h1, h2} mutual exclusive. For them P(h1) = 0.2 and p(h3) = 0.3 (prior distribution).
Suppose we know also that
P(Y=0 | h1) = 0.2
P(Y=0 | h2) = 0.4
where Y is an attribute (target) that can have two values {1,0}.
Suppose finally that you observe the event Y = 0.
Which one is the MAP (Maximum a posteriori) hipotesys?
MAP is h1
MAP is h2
there's no enough element to find MAP
MAP h1 = MAP h2
nobody of the possible answer above
Such question should be asked (and now probably migrated) on the math.stackexchange.com or stats.stackexchange.com .
Your question is basic application of the Bayes Theorem
P(Y=0|h1)P(h1) 0.2*0.2 0.04
P(h1|Y=0) = ------------- = ------- = ------
P(Y=0) P(Y=0) P(Y=0)
P(Y=0|h2)P(h2) 0.3*0.4 0.12
P(h2|Y=0) = -------------- = ------- = ------
P(Y=0) P(Y=0) P(Y=0)
So the h2 is the more probable hypothesis, as P(Y=0)>0