I have a tiff file and the text on it, which has been OCR'd at an earlier stage. The words have their exact positions as information (upper left, lower right). I now need to read the text within a user-drawn rectangle.
Normal paragraphs are no problem, but I don't know how I should handle text columns. If there are two paragraphs next to each other, simply taking the row as a single line would make the result unusable.
Are there algorithms to help me put the words in the right order? I'm guessing that I have to examine the spaces between words to detect patterns that identify columns. I would like to avoid processing the image directly, although it should be possible (but no OCR).
I am also unsure about the influence of lists/tables, e.g. in orders & bills. A line-orientated approach would probably be better here.
I am developing in Delphi, but adaptable algorithms in other languages would also be appreciated.
edit: I will try to post sample data tomorrow, but basically I have an Array of Words, with their respective coordinates on the image (I could easily draw a rectangle around them, for example).
Suppose your original text is in two columns like this:
Aaaa bb ccc ddddd mmmm nn oooo pp
eee fff ggggg hh qqq rrrrrrrrr
i jjjj kkk lll sss tttt uu.
From your description, it sounds like your OCR has given you the individual words and their bounding rectangles. If the original page is scanned orthogonally, then all of the words on a given line should have the same (or very close) y values. If they're not exactly the same, you can do an integer division on the vertical positions with some fraction of a typical box height. That should cluster the y values. You can do similar processing on the x coordinates to ensure that words at the edge of a column also have identical x values.
To detect the separate columns, I'd try making a histogram of all the "left" values of all the words (or right edges if your text runs right-to-left). You should see a peak at the beginning of each column.
You can probably rule out any false positives by ensuring that, on every line, there is a gap between the right coordinate of the last box before the candidate start of a column. The gap should probably be at least as large as the smallest width of any word.
You can then partition your words up into column groups by checking which horizontal range their left and right coordinates fall in to. In our example, the words from Aaaa through lll would end up in the first partition and the words from mmmm through uu. would end up in the second partition.
Within each partition, you can then partition on line by sorting on the y coordinates. Finally, for each line, you sort on the x coordinate. (Whether you sort on ascending or descending depends on your coordinate system and the direction your text flows.)
The same basic idea could be applied to tables as well as columns of text, but you might need some tweaks to deal with things like right-justified cells.
Related
I am need to calculate the dimensions from cell values that are entered as strings into a single cell vs. 3 separate cells and I do not want to break the dimensions into Length (L) x Width (W) x Height (H) Columns.. instead I am hoping there is a relatively simple function that would allow me to calculate the total cubic dimensions from that single cell.
I am aware of this tutorial that can take a string and be used to break it into 3 separate cells.. but that defeats the point of what I am trying to do.
My data looks like this:
Dimensions
Cub/In.
CF
70x13x13
11830
6.85
24*18*13
5616
3.25
24x16x12
16x24x10
Right now the data is entered as either "LxWxH" or "L*W*H" in that text formant and the columns that have values like the 5616 above are me manually re-entering "=24*18*13".. literally one character difference.
I did try a CONCATENATE to just append an "=" to the beginning but got errors on all in Google Sheets (for comparison) or a Literal string into processed as a formula in Excel.
=CONCATENATE(“=”,B1)
Looking for a simple way to do this calculation in a single column and being able to have to enter the data once or utilize the existing data. I don't mind doing a single bulk replace of "x" to "*" on the input column to standardize the source column but don't want to have to do a series of bulk replaces every time I want to run through the data.
Thoughts?
Use SUBSTITUTE to get them all to the same, then use SPLIT and wrap in PRODUCT:
=PRODUCT(SPLIT(SUBSTITUTE(A2,"*","x"),"x"))
Or shorter Version shown by #JvdV:
=PRODUCT(SPLIT(A2,"*x"))
I'm trying to count the number of empty cells that exist in a column between each non-empty cell but haven't been able to work out how.
Using this, I'm also trying to find the largest "empty distances" and locate the cell in the center of these distances.
The sheet I'm working with lists a set of marker colors and denotes the ones that are owned out of the full set of colors. I'm trying to find the largest ranges of missing colors and then find the colors in the middle of those ranges in order to find a handful of markers that would best help to fill out the spectrum.
Columns 1-6 are information- Column 7 marks whether the color is owned:
I may have an answer that helps you.
I could only get it to work using a helper column, but someone may know how to eliminate that requirement.
The helper column creates an array, basically listing the row numbers of the rows that have an "x" in your column B.
The main formula then measures the gap between each of these listed row numbers. It also checks the gap before the first "x", and after the last "x". Note that I have the data starting on row 2, which complicates the formula, but makes the sample sheet clearer - this can easily be changed to row 1 if you prefer.
={F2-1;
query(ArrayFormula(if(isnumber(F3:F),F3:F-F2:F-1,"")),
"select Col1 where Col1 > 0",0);
counta(A2:A)-indirect("F"&COUNTA(F$2:F))}
See a sample sheet here:
https://docs.google.com/spreadsheets/d/19QUFGRqTT6BqOsBrEBpTIxQCeNdRa5mzXhxQpCZ8sV4/edit?usp=sharing
Then I used a second formula to calculate the max gap between "x"s, (or before the first or after the last x).
Note that calculating the midpoint of the gaps, and doing a lookup of the corresponding mid-point colour, is something that can be added to this answer, if you share a sample copy of your sheet and share it for editing.
Let me know if this helps. I'll add more explanation to describe what the formula is doing tomorrow.
And I'll provide a second tab with the formulas adjusted to work with data beginning on row 1.
You can also get the lengths of the gaps using Frequency:
=ArrayFormula(frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),if((B1:B20="X")*(A1:A20<>""),row(B1:B20))))
but finding the centres of the gaps and allowing for equal-sized gaps is more difficult.
This should find the position of the "X" at the end of the longest gap:
=ArrayFormula(
sum(frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),
if((B1:B20="X")*(A1:A20<>""),row(B1:B20)))*(sequence(countif(B1:B20,"X")+1,1)<=
match(max(frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),
if((B1:B20="X")*(A1:A20<>""),row(B1:B20)))),frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),
if((B1:B20="X")*(A1:A20<>""),row(B1:B20))),0)))+
countif(sequence(countif(B1:B20,"X")+1,1),"<="&
match(max(frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),
if((B1:B20="X")*(A1:A20<>""),row(B1:B20)))),frequency(if((B1:B20<>"X")*(A1:A20<>""),row(B1:B20)),
if((B1:B20="X")*(A1:A20<>""),row(B1:B20))),0))
)
and then it should just be a case of working backwards from there to the centre of the longest gap. However the formula needs further refinement to deal with the cases
(1) Where the longest gap is after the last "X"
(2) Where there is a tie for the longest gap
(3) Where there is a need to list the longest, second longest, third longest gap etc.
I have a sheet with a line chart, now I'm trying to do something maybe very simple: I would like to add to this chart a vertical line using a value in a cell.
So I have this line chart
And a cell with the date 2016/01/01, I would like to have a vertical line through all the chart on the cell date
I can't figure out how to do it...
This is a copy of that sheet: https://docs.google.com/spreadsheets/d/1oeiwmeDT8pUVqBQvoE_cqk7mZxxvD5moZr41Vp4IN2I/edit?usp=sharing
I would like to show a vertical line using the "Purchase date"
I had the same problem and created a solution to overcome limitations of Google Sheets charts.
The main idea is to create an additional line in the chart, with only two points, both with the desired date. The value of the first point is 0 and the last has the maximum value of the Y axis. This way, the line always covers the entire height of the chart.
Screenshot of the Chart
Note that it is necessary to add two new values in the X axis (highlighted in blue on the sheet). Don't worry with the fact they are repeated. Google Sheets handles it correctly.
These values can be placed at the beginning of the lists. This way, it is possible to add new values at the end of them.
This solution can be viewed in: "[GoogleSheets] Dinamic Vertical Line in a Chart"
To change position of red line, just select a different value in "Purchase date" (yellow cell).
I made a merge of my first solution with the one suggested by dimo414 and created a new solution with two variations.
In the previous version of the spreadsheet, there were only two points to draw the vertical line.
In the new version, a third point were inserted to show intersection between the line and the real curve. A new column was also created, containing only a label for the new point.
The result is:
Theses changes can be seen in green background in sheets 'Dashboard_v2' and 'Dashboard_v3' of the SpreadSheet.
To determine coordinates of the new point, two approaches were used:
Search Purchase Date directly in the dataset (see sheet 'Dashboard_v2')
If the goal is to highlight only points of intersection that belong to the original dataset, it is just necessary to VLOOKUP() the date in the dataset.
Interpolate the two points immediately smaller and larger than the purchase date (see sheet 'Dashboard_v3')
Given the points [x1,y1], [x2,y2] and a value of x (where x1 <= x <= x2), its possible to find an interpolation point [x,y] with the following formula:
y=(y2-y1)*(x-x1)/(x2-x1)+y1
Although this formula is easy to implement, find the correct points to interpolate is more challenging and requires a bit of creativity.
At first, I thought of using a JS script to make things easier, but decided to use only builtin functions.
By the way, different approaches to find [x1,y1] and [x2,y2] are welcome.
To make things easier to understand, each point coordinate is determined in a different cell (see L2:M5) and the point of intersection is in L6:M7.
Of course, its possible to join all of them in just one cell, but I thought it would be harder to understand.
To close, one more detail: According to above definition, interpolation formula is valid only if (x1 <= x <= x2). Thus, both cells C2 and M6 have protections to limit the value of 'x'.
One way is to add a label to your x-axis.
For example, this is a chart that plots weight against date, with a label "Cheat Day" on 2021-07-21
For the data:
Date
Label
Weight (kg)
Weight Goal (kg)
2021-07-19
83.85
75
2021-07-20
84.55
75
2021-07-21
Cheat Day
83.8
75
2021-07-22
84.95
75
2021-07-23
83.75
75
Go to Edit the chart > Setup > Under X-axis > Click on ••• next to your "Date" column > Add labels > Select the column "Label" as your label.
Your Chart Editor > Setup should look like this:
you can have it like this, unfortunately not programmatically. the only way is to insert a line via Drawing and position it manually where needed.
spreadsheet demo
As best I can tell there isn't a way to add a vertical marker line to a chart in Google Sheets. One option that may be "good enough" in many cases is to "Add notes to a data point" and then use "Format data point" to make the point more visible. Here's an example, from your spreadsheet:
Unfortunately one limitation with this approach is you can only label a data point in the data set the chart is displaying. In your case the date you wanted to mark with a line isn't in the data set, so this won't work directly. You might be able to introduce a separate data series consisting of just that date and then add a note to that data point, but I haven't fiddled with it enough to make it work.
How to convert a range of arbitrary size, to it's string representation, to be later used using INDIRECT ?
ADDRESS only works for a single cell.
No scripting please :)
Only formulas.
EDIT 1 ;
Also assume the range is computed through a complex formula. So no cell addresses are available.
EDIT 2 ;
=ArrayFormula(IF((OFFSET(INDIRECT(PayStaff),0,10)>PayrollStart)*(OFFSET(INDIRECT(PayStaff),0,10)<=PayrollEnd),(OFFSET(INDIRECT(PayStaff),0,10)-PayrollStart),0)+IF((OFFSET(INDIRECT(PayStaff),0,11)>=PayrollStart)*(OFFSET(INDIRECT(PayStaff),0,11)<PayrollEnd),(PayrollEnd-OFFSET(INDIRECT(PayStaff),0,11)),0))
No sample sheet as this is a hypothetical type question.
Above is an actual formula I'm using, one of many in my efforts at fully automation. At any time, the range string in "PayStaff" can change, coupled with repeated but identical OFFSET calls, soon makes the formula unreadable. One of my ideas to solve the readability is to get the range-string of (eg: OFFSET(INDIRECT(PayStaff),0,10) ) and reuse it, shortening the formula and increasing readability. Also note, the example does NOT have size increase of range which I require as well.
But lets suppose that PayStaff = "A1:A10", where the number of rows can vary. Considering that OFFSET has 4 parameters, how to get the resultant range as a string? Is this possible?
I just used this to do something similar, hope this helps:
// <range here> = is the range you want to describe,
// can be output of some other formula but has to be a
// rectangle for this to work.
// Remove // comments and line breaks before pasting to sheet.
// concatenate top left corner with colon and bottom right corner.
=CONCATENATE(
// use address to get top left corner
ADDRESS(ROW(<range here>), COLUMN(<range here>)),
":",
// use address to get bottom right corner
ADDRESS(
ROW(<range here>)+ROWS(<range here>)-1,
COLUMN(<range here>)+COLUMNS(<range here>)-1
)
)
This would give you the start of the range as a string:
=cell("Address",OFFSET(INDIRECT(PayStaff),0,10))
The end of the range is more awkward - you would have to add the number of rows in the range (-1) to the row offset:
=cell("Address",OFFSET(INDIRECT(PayStaff),rows(indirect(PayStaff))-1,10))
I have a 5x5 grid of tiles that are number like:
Numerical order Row number(eg: 1 1 would be the first tile on the first row and 6 2 would be the first tile on the second row)
I need to get the blocks around a clicked tile (above, below, left and right), I thought about doing this by taking off numbers from the numerical order and row number. I wrote this:
local ab = tostring(tonumber(v.Name)-5)..tostring(tonumber(string.sub(v.Name,-1))-1)
local be = tostring(tonumber(v.Name)+5)..tostring(tonumber(string.sub(v.Name,-1))+1)
ab being the tile above and be being the tile below. I ran into a problem where I cannot get the first two letters of a tile who's numerical order is two digits using one line (I don't want to use if statements since I'm pretty sure there's a one line solution)
I came up with a solution and that is to get all the characters before the whitespace (which separates the order from the row number) but I have no idea how to write it.
Just ask for all non-whitespace characters from the beginning of the string:
print(("test123 more456"):match("^(%S+)"))
This should print test123.