I have a large dataset from which I would like to extract and categorize specific elements. Below is a most common example:
I would like to know if this is possible using Amazon Comprehend or maybe there are better tools to do that. I am not a developer and looking to hire someone to program this for me. But I would like to understand conceptually if something like this feasible before I hire someone.
Comprehend is capable of extracting and categorizing text from your document. You can use Comprehend’s Custom Entity Recognition.
For this, you will provide annotated training data as input. You can leverage Ground Truth in Amazon SageMaker to do the annotations, and directly provide Ground Truth output to Comprehend Entity Recognition Training job. You can also provide your own annotations file for the training job - https://docs.aws.amazon.com/comprehend/latest/dg/API_EntityRecognizerInputDataConfig.html.
The relevant APIs for Amazon Comprehend would be -
Training - https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html
Async Inference - https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html
OR
Sync Inference Over Custom Endpoint -
https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html
Here is a detailed example of how to train custom entity recognizers with Amazon Comprehend - https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html
Annotation file example for this use-case.
File
Line
Begin Offset
End Offset
Type
doc1
3
0
2
Width
doc1
3
5
6
Ratio
doc1
3
9
10
Diameter
doc1
0
12
20
Brand
doc1
0
6
6
Quantity
doc1
6
8
10
Price
doc1
1
20
22
Condition
doc1
0
42
48
Season
doc2
0
45
48
Quantity
doc2
1
78
79
Price
The file doc1 should contain the text that you want to extract entities from.
Related
I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE
My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!
I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.
I would like to use attribute selection for a numeric data-set.
My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here(datasets-numeric.jar)
Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is :
How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say e.g engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances.
Note that relief is more heuristic (i.e. is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent.
However, I'd say that actually your results are pretty similar: e.g. curb-weight and horsepower are pretty close to the top in both methods.
I have this problem.
I have an SPSS sheet that looks like this (it's an analogy, so don't ask me how I have measured it). This example is about tennis players.
Player % of points won % of points won
own service opponent's service
1 50 10
2 80 60
3 70 40
4 80 50
Now I want to know if there's a difference between your own service, and the opponent's service, in terms of points won. (As you see, there probably is. But is it significant?)
Link naar boxplot
So Hypothesis: Own service -+-> points won
Now, Kruskal-Wallis, Independent-samples t test, One Way ANOVA, all require a grouping variable or something, but this is already implied. I could have chosen to make the data set:
Player Own service Won
1 1 0
2 1 1
3 0 1
For all games, group them on own service and see if there is a statistical difference between these groups. Perfectly doable.
The first set of data contains the same information, but presented differently. I just want to compare means between variables, only based on their own value. Can SPSS handle this type of information as well?
If the two variables are inter-related like points in a tennis match (per your example), and organized in different columns like that, a paired-samples t-test should work for you.
Currently I am starting to develop a computer vision application that involves tracking of humans. I want to build ground-truth metadata for videos that will be recorded in this project. The metadata will probably need to be hand labeled and will mainly consist of location of the humans in the image. I would like to use the metadata to evaluate the performance of my algorithms.
I could of course build a labeling tool using, e.g. qt and/or opencv, but I was wondering if perhaps there was some kind of defacto standard for this. I came across Viper but it seems dead and it doesn't quite work as easy as I would have hoped. Other than that, I haven't found much.
Does anybody here have some recommendations as to which software / standard / method to use both for the labeling as well as the evaluation? My main preference is to go for something c++ oriented, but this is not a hard constraint.
Kind regards and thanks in advance!
Tom
I've had another look at vatic and got it to work. It is an online video annotation tool meant for crowd sourcing via a commercial service and it runs on Linux. However, there is also an offline mode. In this mode the service used for the exploitation of this software is not required and the software runs stand alone.
The installation is quite elaborately described in the enclosed README file. It involves, amongst others, setting up an appache and a mysql server, some python packages, ffmpeg. It is not that difficult if you follow the README. (I mentioned that I had some issues with my proxy but this was not related to this software package).
You can try the online demo. The default output is like this:
0 302 113 319 183 0 1 0 0 "person"
0 300 112 318 182 1 1 0 1 "person"
0 298 111 318 182 2 1 0 1 "person"
0 296 110 318 181 3 1 0 1 "person"
0 294 110 318 181 4 1 0 1 "person"
0 292 109 318 180 5 1 0 1 "person"
0 290 108 318 180 6 1 0 1 "person"
0 288 108 318 179 7 1 0 1 "person"
0 286 107 317 179 8 1 0 1 "person"
0 284 106 317 178 9 1 0 1 "person"
Each line contains 10+ columns, separated by spaces. The
definition of these columns are:
1 Track ID. All rows with the same ID belong to the same path.
2 xmin. The top left x-coordinate of the bounding box.
3 ymin. The top left y-coordinate of the bounding box.
4 xmax. The bottom right x-coordinate of the bounding box.
5 ymax. The bottom right y-coordinate of the bounding box.
6 frame. The frame that this annotation represents.
7 lost. If 1, the annotation is outside of the view screen.
8 occluded. If 1, the annotation is occluded.
9 generated. If 1, the annotation was automatically interpolated.
10 label. The label for this annotation, enclosed in quotation marks.
11+ attributes. Each column after this is an attribute.
But can also provide output in xml, json, pickle, labelme and pascal voc
So, all in all, this does quite what I wanted and it is also rather easy to use.
I am still interested in other options though!
LabelMe is another open annotation tool. I think it is less suitable for my particular case but still worth mentioning. It seems to be oriented at blob labeling.
This is a problem that all practitioners of computer vision face. If you're serious about it, there's a company that does it for you by crowd-sourcing. I don't know whether I should put a link to it in this site, though.
I've had the same problem looking for a tool to use for image annotation to build a ground truth data set for training models for image analysis.
LabelMe is a solid option if you need polygonal outlining for your annotation. I've worked with it before and it does the job well and has some additional cool features when it comes to 3d feature extraction. In addition to LabelMe, I also made an open source tool called LabelD. If you're still looking for a tool to do your annotation, check it out!