Ground-truth data collection and evaluation for computer vision - image-processing

Currently I am starting to develop a computer vision application that involves tracking of humans. I want to build ground-truth metadata for videos that will be recorded in this project. The metadata will probably need to be hand labeled and will mainly consist of location of the humans in the image. I would like to use the metadata to evaluate the performance of my algorithms.
I could of course build a labeling tool using, e.g. qt and/or opencv, but I was wondering if perhaps there was some kind of defacto standard for this. I came across Viper but it seems dead and it doesn't quite work as easy as I would have hoped. Other than that, I haven't found much.
Does anybody here have some recommendations as to which software / standard / method to use both for the labeling as well as the evaluation? My main preference is to go for something c++ oriented, but this is not a hard constraint.
Kind regards and thanks in advance!
Tom

I've had another look at vatic and got it to work. It is an online video annotation tool meant for crowd sourcing via a commercial service and it runs on Linux. However, there is also an offline mode. In this mode the service used for the exploitation of this software is not required and the software runs stand alone.
The installation is quite elaborately described in the enclosed README file. It involves, amongst others, setting up an appache and a mysql server, some python packages, ffmpeg. It is not that difficult if you follow the README. (I mentioned that I had some issues with my proxy but this was not related to this software package).
You can try the online demo. The default output is like this:
0 302 113 319 183 0 1 0 0 "person"
0 300 112 318 182 1 1 0 1 "person"
0 298 111 318 182 2 1 0 1 "person"
0 296 110 318 181 3 1 0 1 "person"
0 294 110 318 181 4 1 0 1 "person"
0 292 109 318 180 5 1 0 1 "person"
0 290 108 318 180 6 1 0 1 "person"
0 288 108 318 179 7 1 0 1 "person"
0 286 107 317 179 8 1 0 1 "person"
0 284 106 317 178 9 1 0 1 "person"
Each line contains 10+ columns, separated by spaces. The
definition of these columns are:
1 Track ID. All rows with the same ID belong to the same path.
2 xmin. The top left x-coordinate of the bounding box.
3 ymin. The top left y-coordinate of the bounding box.
4 xmax. The bottom right x-coordinate of the bounding box.
5 ymax. The bottom right y-coordinate of the bounding box.
6 frame. The frame that this annotation represents.
7 lost. If 1, the annotation is outside of the view screen.
8 occluded. If 1, the annotation is occluded.
9 generated. If 1, the annotation was automatically interpolated.
10 label. The label for this annotation, enclosed in quotation marks.
11+ attributes. Each column after this is an attribute.
But can also provide output in xml, json, pickle, labelme and pascal voc
So, all in all, this does quite what I wanted and it is also rather easy to use.
I am still interested in other options though!

LabelMe is another open annotation tool. I think it is less suitable for my particular case but still worth mentioning. It seems to be oriented at blob labeling.

This is a problem that all practitioners of computer vision face. If you're serious about it, there's a company that does it for you by crowd-sourcing. I don't know whether I should put a link to it in this site, though.

I've had the same problem looking for a tool to use for image annotation to build a ground truth data set for training models for image analysis.
LabelMe is a solid option if you need polygonal outlining for your annotation. I've worked with it before and it does the job well and has some additional cool features when it comes to 3d feature extraction. In addition to LabelMe, I also made an open source tool called LabelD. If you're still looking for a tool to do your annotation, check it out!

Related

Can Amazon Comprehend extract and categorizing data from classifieds

I have a large dataset from which I would like to extract and categorize specific elements. Below is a most common example:
I would like to know if this is possible using Amazon Comprehend or maybe there are better tools to do that. I am not a developer and looking to hire someone to program this for me. But I would like to understand conceptually if something like this feasible before I hire someone.
Comprehend is capable of extracting and categorizing text from your document. You can use Comprehend’s Custom Entity Recognition.
For this, you will provide annotated training data as input. You can leverage Ground Truth in Amazon SageMaker to do the annotations, and directly provide Ground Truth output to Comprehend Entity Recognition Training job. You can also provide your own annotations file for the training job - https://docs.aws.amazon.com/comprehend/latest/dg/API_EntityRecognizerInputDataConfig.html.
The relevant APIs for Amazon Comprehend would be -
Training - https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html
Async Inference - https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html
OR
Sync Inference Over Custom Endpoint -
https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html
Here is a detailed example of how to train custom entity recognizers with Amazon Comprehend - https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html
Annotation file example for this use-case.
File
Line
Begin Offset
End Offset
Type
doc1
3
0
2
Width
doc1
3
5
6
Ratio
doc1
3
9
10
Diameter
doc1
0
12
20
Brand
doc1
0
6
6
Quantity
doc1
6
8
10
Price
doc1
1
20
22
Condition
doc1
0
42
48
Season
doc2
0
45
48
Quantity
doc2
1
78
79
Price
The file doc1 should contain the text that you want to extract entities from.

What is the function to just identify outliers in Google Sheets?

I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE

Why is kubernetes source code an order of magnitude larger than other container orchestrators?

Considering other orchestration tools like dokku, dcos, deis, flynn, docker swarm, etc.. Kubernetes is no where near to them in terms of lines of code, on an average those tools are around 100k-200k lines of code.
Intuitively it feels strange that to manage containers i.e. to check health, scale containers up and down, kill them, restart them, etc.. doesn't have to consist of 2.4M+ lines of code (which is the scale of an entire Operating System code base), I feel like there is something more to it.
What is different in Kubernetes compared to other orchestration solutions that makes it so big?
I dont have any knowledge of maintaining more than 5-6 servers. Please explain why it is so big, what functionalities play big part in it.
First and foremost: don't be misled by the number of lines in the code, most of it are dependencies in the vendor folder that does not account for the core logic (utilities, client libraries, gRPC, etcd, etc.).
Raw LoC Analysis with cloc
To put things into perspective, for Kubernetes:
$ cloc kubernetes --exclude-dir=vendor,_vendor,build,examples,docs,Godeps,translations
7072 text files.
6728 unique files.
1710 files ignored.
github.com/AlDanial/cloc v 1.70 T=38.72 s (138.7 files/s, 39904.3 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Go 4485 115492 139041 1043546
JSON 94 5 0 118729
HTML 7 509 1 29358
Bourne Shell 322 5887 10884 27492
YAML 244 374 508 10434
JavaScript 17 1550 2271 9910
Markdown 75 1468 0 5111
Protocol Buffers 43 2715 8933 4346
CSS 3 0 5 1402
make 45 346 868 976
Python 11 202 305 958
Bourne Again Shell 13 127 213 655
sed 6 5 41 152
XML 3 0 0 88
Groovy 1 2 0 16
--------------------------------------------------------------------------------
SUM: 5369 128682 163070 1253173
--------------------------------------------------------------------------------
For Docker (and not Swarm or Swarm mode as this includes more features like volumes, networking, and plugins that are not included in these repositories). We do not include projects like Machine, Compose, libnetwork, so in reality the whole docker platform might include much more LoC:
$ cloc docker --exclude-dir=vendor,_vendor,build,docs
2165 text files.
2144 unique files.
255 files ignored.
github.com/AlDanial/cloc v 1.70 T=8.96 s (213.8 files/s, 30254.0 lines/s)
-----------------------------------------------------------------------------------
Language files blank comment code
-----------------------------------------------------------------------------------
Go 1618 33538 21691 178383
Markdown 148 3167 0 11265
YAML 6 216 117 7851
Bourne Again Shell 66 838 611 5702
Bourne Shell 46 768 612 3795
JSON 10 24 0 1347
PowerShell 2 87 120 292
make 4 60 22 183
C 8 27 12 179
Windows Resource File 3 10 3 32
Windows Message File 1 7 0 32
vim script 2 9 5 18
Assembly 1 0 0 7
-----------------------------------------------------------------------------------
SUM: 1915 38751 23193 209086
-----------------------------------------------------------------------------------
Please note that these are very raw estimations, using cloc. This might be worth a deeper analysis.
Roughly, it seems like the project accounts for half of the LoC (~1250K LoC) mentioned in the question (whether you value dependencies or not, which is subjective).
What is included in Kubernetes that makes it so big?
Most of the bloat comes from libraries supporting various Cloud providers to ease the bootstrapping on their platform or to support specific features (volumes, etc.) through plugins. It also has a Lot of Examples to dismiss from the line count. A fair LoC estimation needs to exclude a lot of unnecessary documentation and example directories.
It is also much more feature rich compared to Docker Swarm, Nomad or Dokku to cite a few. It supports advanced networking scenarios, has load balancing built-in, includes PetSets, Cluster Federation, volume plugins or other features that other projects do not support yet.
It supports multiple container engines, so it is not exclusively running docker containers but could possibly run other engines (such as rkt).
A lot of the core logic involves interaction with other components: Key-Value stores, client libraries, plugins, etc. which extends far beyond simple scenarios.
Distributed Systems are notoriously hard, and Kubernetes seems to support a majority of the tooling from key players in the container industry without compromise (where other solutions are making such compromise). As a result, the project can look artificially bloated and too big for its core mission (deploying containers at scale). In reality, these statistics are not that surprising.
Key idea
Comparing Kubernetes to Docker or Dokku is not really appropriate. The scope of the project is far bigger and it includes much more features as it is not limited to the Docker family of tooling.
While Docker has a lot of its features scattered across multiple libraries, Kubernetes tends to have everything under its core repository (which inflates the line count substantially but also explains the popularity of the project).
Considering this, the LoC statistic is not that surprising.
Aside from the reasons given by #abronan, the Kubernetes codebase contains lots of duplication and generated files which will artificially increase the code size. The actual size of the code that does "real work" is much smaller.
For example, take a look at the staging directory. This directory is 500,000 LOC but nothing in there is original code; it is all copied from elsewhere in the Kubernetes repo and rearranged. This artificially inflates the total LOC.
There's also things like Swagger API generation which are auto-generated files that describe the Kubernetes API in the OpenAPI format. Here are some places where I found these files:
kubernetes/api/
Kubernetes/federation/apis/swagger-spec
kubernetes/federation/apis/openapi-spec
Together these files account for ~116,000 LOC and all they do is describe the Kubernetes API in OpenAPI format!
And these are just the OpenAPI definition files - the total number of LOC required to support OpenAPI is probably much higher. For instance, I've found a ~12,000 LOC file and a ~13,000 LOC file that are related to supporting Swagger/OpenAPI. I'm sure there are plenty more files related to this feature as well.
The point is that the code that does the actual heavy lifting behind the scenes might be a small fraction of the supporting code that is required to make Kubernetes a maintainable and scalable project.

Genetic Algorithms - Crossover and Mutation operators for paths

I was wondering if anyone knew any intuitive crossover and mutation operators for paths within a graph? Thanks!
Question is a bit old, but the problem doesn't seem to be outdated or solved, so I think my research still might be helpful for someone.
As far as mutation and crossover is quite trivial in the TSP problem, where every mutation is valid (that is because chromosome represents an order of visiting fixed nodes - swapping order then always can create a valid result), in case of Shortest Path or Optimal Path, where the chromosome is a exact route representation, this doesn't apply and isn't that obvious. So here is how I approach problem of solving Optimal Path using GA.
For crossover, there are few options:
For routes that have at least one common point (besides start and end node) - find all common points and swap subroutes in the place of crossing
Parent 1: 51 33 41 7 12 91 60
Parent 2: 51 9 33 25 12 43 15 60
Potential crossing point are 33 and 12. We can get following children: 51 9 33 41 7 12 43 15 60 and 51 33 25 12 91 60 that are the result of crossing using both of these crossing points.
When two routes don't have common point, select randomly two points from each parent and connect them (you can use for that either random traversal, backtracking or heuristic search like A* or beam search). Now this path may be treated as crossover path. For better understanding, see below picture of two crossover methods:
see http://i.imgur.com/0gDTNAq.png
Black and gray paths are parents, pink and orange paths are
children, green point is a crossover place, and red points are start
and end nodes. First graph shows first type of crossover, second graph is example of another one.
For mutation, there are also few options. Generally, dummy mutation like swapping order of nodes or adding random node is really ineffective for graphs with average density. So here are the approaches that guarantee valid mutations:
Take randomly two points from path and replace them with a random path between those two nodes.
Chromosome: 51 33 41 7 12 91 60 , random points: 33 and 12, random/shortest path between then: 33 29 71 12, mutated chromosome: 51 33 29 71 12 91 60
Find random point from path, remove it and connect its neighbours (really very similar to the first one)
Find random point from path and find random path to its neighbour
Try subtraversing the path from some randomly chosen point, until reaching any point on the initial route (slight modification of the first method).
see http://i.imgur.com/19mWPes.png
Each graph corresponds to each mutation method in appropriate order. In last example, the orange path is the one that would replace original path between mutation points (green nodes).
Note: this methods obviously may have performance drawback in the case, when finding alternative subroute (using a random or heuristic method) will stuck at some place or find very long and useless subpath, so consider bounding the time of mutation execution or trials number.
For my case, which is finding an optimal path in terms of maximizing sum of vertices weights while keeping sum of nodes weight less than given bound, those methods are quite effective and give a good result. Should you have any question, feel free to ask. Also, sorry for my MS Paint skills ;)
Update
One big hint: I basically used this approach in my implementation, but there was one big drawback of using random path generating. I decided to switch to semi-random route generation using shortest path traversing randomly picked point(s) - it is much more efficent (but obviously may not be applicable for all problems).
Emm.. That is very difficult question, people write dissertations for that and still there is no right answer to that.
The general rule is "it all depends on your domain".
There are some generic GA libraries that will do some work for you, but for the best results it is recommended to implement your GA operations yourself, specifically for your domain.
You might have more luck with answers on Theoretical CS, but you need to expand your question more and add more details about your task and domain.
Update:
So you have a graph. In GA terms, a path through the graph represents an individual, nodes in the path would be chromosomes.
In that case I would say a mutation can be represented as deviation of the path somewhere from the original - one of the nodes is moved somewhere, and the path is adjusted so the start and end values in the path are remaining the same.
Mutation can lead to invalid individuals. And in that case you need to make a decision: allow invalid ones and hope that they will converge to some unexplored solution. Or kill them on the spot. When I was working with GA, I did allow invalid solution, adding "Unfitness" value along with fitness. Some researchers suggest this can help with broad exploring of the solution space.
Crossover can only happen to the paths that are crossing each other: on the point of the crossing, swap the remains of the path with the parents.
Bear in mind that there are various ways for crossover: individuals can be crossed-over in multiple points or just in one. In the case with graphs you can have multiple crossing points, and that can naturally lead to the multiple children graphs.
As I said before, there is no right or wrong way of doing this, but you will find out the best way only by experimenting on it.

How to fill in the 'holes' in an irregular spaced grid or array having missing data?

Does anyone have a straight forward Delphi example of filling in a grid using Delaunay
Triangles or kriging? Either method can fill a grid by 'interpolating.'
What do I want to do? I have a grid, similar to:
22 23 xx 17 19 18 05
21 xx xx xx 17 18 07
22 24 xx xx 18 21 20
30 22 25 xx 22 20 19
28 xx 23 24 22 20 18
22 23 xx 17 23 15 08
21 29 30 22 22 17 09
where the xx's represent grids cells with no data and the x,y coordinates of each cell is
known. Both kriging and Delaunay Triangles can supply the 'missing' points (which of course, are fictitious, but reasonable values).
Kriging is a statistical method to fill in 'missing' or unavailable data in
a grid with 'reasonable' values. Why would you need it? Principly to 'contour' the
data. Contouring algorithms (like CONREC for Delphi http://local.wasp.uwa.edu.au/~pbourke/papers/conrec/index.html) can contour regularly spaced data. Google around for 'kriging' and 'Delphi' and you eventually are pointed to the GEOBLOCK project on Source Forge (http://geoblock.sourceforge.net/ ). Geoblock has numerous Delphi pas units for kriging based on GSLIB (a Fortran statistical package developed at Stanford). However all the kriging/delauney units are dependent on units refered to in the Delphi uses clause. Unfortunately, these 'helper' units are not posted with the rest of the source code. It appears none of the kriging units can stand alone or work without helper units that are not posted or in some cases, undefined data types.
Delaunay triangulation is described at
http://local.wasp.uwa.edu.au/~pbourke/papers/triangulate/index.html. Posted is
a Delphi example, pretty neat, that generates 'triangles.' Unfortunately, I
haven't a clue how to use the unit with a static grid. The example 'generates' a data field on the fly.
Has anyone got either of these units to work to fill an irregular data grid? Any code or hints how to use the existing code for kriging a simple grid or using Delaunay to fill in the holes would be appreciated.
I'm writing this as an answer because it's too long to fit into a comment.
Assuming your grid really is irregular (you give no examples of a typical pattern of grid coordinates), then triangulation only partially helps. Once you have triangulated you would then use that triangulation to do an interpolation, and there are different choices that could be made.
But you've not said anything about how you want to interpolate, what you want to do with that interpolation.
It seems to me that you have asked for some code, but it's not clear that you know what algorithm you want. That's really the question you should have asked.
For example since you appear to have no criteria for how you should do the interpolation, why don't you choose the nearest neighbour for your missing values. Or why don't you use the overall mean for the missing values. Both of these choices meet all the criteria you have specified since you haven't specified any!
Really I think you need to spend some more time explaining what properties you want this interpolation to have, what you are going to do with it etc. I also think you should stop thinking about code for now and think about algorithms. Since you have mentioned statistics you should consider asking at https://stats.stackexchange.com/.
Code posted by Richard Winston on the Embarcadero Developer Newwork Code Central titled
Delaunay triangulation and contouring code
( ID: 29365 ) demonstrates routines for generating constrained Delaunay triangulations and for plotting contour lines based on data points at arbitrary locations. Richard's code . These algorithms do not manipulate and fill-in the holes in a grid. They do provide a method for contouring arbitrary data and do not require a grid without missing values.
I still have not found an acceptable krieging algorithm in Pascal to actually fill-in the holesin a grid.

Resources