What does this "classifiers": [] mean? - watson

Hi I am currently trying to utilize Watson's Visual Reco Service and I am getting a really weird response. After reading the documentaion I am guessing this photo doesn't meet the threshold value but I am not actually sure. Here's the a snippet of one of my response:
{ "classifiers": [{
"classes": [ { "class": "classname", "score": 0.522029 } ],
"classifier_id": "normalLeft_329785087", "name": "normalLeft" } ],
"image": "Testing_Left.zip/80589N.jpg"
},
{
"classifiers": [],
"image": "Testing_Left.zip/81860Y.jpg"
},
Another issue related to this is that sometimes my zip files aren't recognized by watson. Is there any particular reason why watson would have difficulties with zip files?
Thanks for the help in advance.

After reading the documentaion I am guessing this photo doesn't meet the threshold value but I am not actually sure.
That's exactly it. It means none of the classes in the classifiers applied to the image Testing_Left.zip/81860Y.jpg returned a score above the threshold. By default for custom classifiers, the threshold is 0.5 You can set the threshold parameter to 0 if you would like to see all each score per class per image.
Is there any particular reason why watson would have difficulties with zip files?
We have observed problems with some zip files with files or directories inside which have extended character sets, such as accented letters. Could that be the case for you?

Related

Avoid reshuffling when using state or timers?

I find myself in situations writing Beam pipelines where I want to use state or timers, where the data may already be sharded a certain way by a previous GroupByKey which I do not want to disturb. But the API says state or timers requires KV inputs to the PTransform. Perhaps stateful/timerful transforms do a GroupByKey internally.
Is there a way to use state/timers without resharding/reshuffling?
Here is a concrete use case: I am running into a performance problem when implementing my own metrics collection instead of using the beam built-in Stackdriver metrics, where the system lag of transforms involving shuffling starts shooting up and in general does not recover after some time.
Here is the relevant code where I funnel into one key the metric values from various places where the data is potentially sharded differently, simply because I needed to use timers.
metricsFromVariousPlaces
.apply(Flatten.pCollections())
.apply(WithKeys.of(null.asInstanceOf[Void]))
.apply("write influx points",
new InfluxSinkTransform(...)
InfluxSinkTransform requires timers in order to flush writes to InfluxDB in a timely fashion.
I understand this causes reshuffling because now the data is all under one shard. I expect this reshuffling to be expensive and hope to avoid it if possible.
I tried preserving the keys from the previous transform, but looks like there is still shuffling:
"stage_id": "S68",
"stage_name": "F503",
"fused_steps": [
...
{
"name": "s29.org.apache.beam.sdk.values.PCollection.<init>:402#b70c45c110743c2b-write-streaming-shuffle430",
"originalTransform": "snapshot/MapElements/Map.out0",
"userName": "snapshot/MapElements/Map.out0/FromValue/WriteStream"
}
]
...
{
"stage_id": "S74",
"stage_name": "F509",
"fused_steps": [
{
"name": "s29.org.apache.beam.sdk.values.PCollection.<init>:402#b70c45c110743c2b-read-streaming-shuffle431",
"originalTransform": "snapshot/MapElements/Map.out0",
"userName": "snapshot/MapElements/Map.out0/FromValue/ReadStream"
},
{
"name": "s30",
"originalTransform": "snapshot/write influx points/ParDo(Do)",
"userName": "snapshot/write influx points/ParDo(Do)"
}
]
There is no way to use state or timers without inducing shuffling, even when the keys stay unchanged.

Training spaCy's NER model from scratch on CoNLL 2003 data got very weird results

I'm trying to try training NER models using spaCy from scratch. I wanted to first try it out on CoNLL 2003 data, which is widely used as a baseline for NER systems.
The following are the commands I ran:
spacy convert -c ner train.txt valid.txt test.txt spacyConverted
cd spacyConverted
python -m spacy train en trained train.txt.json valid.txt.json --no-tagger --no-parser
mkdir displacy
python -m spacy evaluate trained/model-final test.txt.json --displacy-path displacy
However, the evaluation results on test data are very weird and totally off, as seen in the following displacy output.
The precision, recall and f1 scores are very low both during the training and the evaluation.
I do believe that the commands are correct and in accordance with the documentation. What could be the possible problem here? Could it be that I must supply some word vectors as well? If so, how do I supply those that come by default in spaCy? Or could it be that one cannot use --no-tagger --no-parser?
The converted .json files look like the following:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"orth":"-DOCSTART-",
"tag":"-X-",
"ner":"O"
}
]
},
{
"tokens":[
{
"orth":"EU",
"tag":"NNP",
"ner":"U-ORG"
},
{
"orth":"rejects",
"tag":"VBZ",
"ner":"O"
},
{
"orth":"German",
"tag":"JJ",
"ner":"U-MISC"
},
{
"orth":"call",
"tag":"NN",
"ner":"O"
},
{
"orth":"to",
"tag":"TO",
"ner":"O"
},
...
EDIT: It seemed that I actually needed to pass in the --gold-preproc flag for the training to properly work. But I'm not sure what it actually means in this context.
I think you have a problem on your pre-processing. Review the pre-processing step, you have to collect a list of sentences where each line is a token a and new line separate sentences from each other.
Also, pay attention to these tokens:
-DOCSTART-
They are just separators between documents. I had that problem as well, and my results were bad. If you want have a look at how I pre-process it, to use it for other purposes not for spaCy.

What does '(Required)' mean exactly in get_survey_details?

https://developer.surveymonkey.com/mashery/get_survey_details
says (for example)
data.pages[].questions[].answers[_].is_answer
(Required)
What does (Required) mean there? Not all responses contain all (Required) fields so it does not mean that a field thus marked is always returned.
Just wondering,
Patrick
Edit 25-July: I am asking this in the context of an "Other (please specify)" option.
Here is an example
{"text": "Annet (vennligst spesifiser)", "visible": true, "is_answer": true, "apply_all_rows": false, "type": "other", "answer_id": "6886575992"}]
Can I be sure that the presence of an "is_answer" field means that this is a free text input to a multiple-choice single-response question? Optiontype 10 in the old RDD format. I think optiontype 10 was not counted as a response in Responses.xls while optiontype 11 was in both Responses.xls and ResponsesText.xls but my memory of that is fading. Anyway, that is past now so I just want to be sure I correctly identify this response type.
"required" means that a value will always be returned by the API. Other attributes marked as "optional" will only be returned for certain kinds of data. For example "data.pages[].questions[].answers[].items[].type" which is marked "optional" will only be returned for matrix question types as described in the description for that attribute to the right.

Add additional data to a Highcharts series for use in formatters

My question is exactly the same as the OP in this question:
Set Additional Data to highcharts series
But the accepted answer explains how to add additional data to the point, not the series, without saying if it's possible to do with the series or not.
I would like to be able to define a series like:
series: [
{"hasCustomFlag": true, "name": "s1", "data": [...]},
{"hasCustomFlag": false, "name": "s2", "data": [...]},
]
and be able to use point.series.hasCustomFlag inside of a formatting function. Is this possible?
I don't want to put the data on the point level, because that means I'd have to duplicate the data far too many times.
Yes this is possible, the extra configuration properties is located under the options property (this.series refers to the series instance, not the configuration objects). See the reference here and scroll down to properties section.
So instead use this line in the formatter:
if (this.series.options.hasCustomFlag) { ... }
Full example on jsfiddle
This appears to have been revised with later iterations of HighCharts/HighStocks. The jsfiddle example no longer works. Using the "this.series.options.hasCustomFlag" syntax results in "undefined". The debugger shows the data I'm looking for is in "this.series.userOptions.data" - an unsorted very large array, but the entire series is there - not the specific record data you normally get with this.x or this.y.

Free City/State/Zipcode to Latitude/Longitude Database? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Is there a standard database of records mapping city/state/zip to lat/lng? I found this database, and the USPS has a simple (no lat/lng) API, but is there another recommended alternative?
Update: Found this too: http://www.geopostcodes.com/
Just a small note. Most of these 3rd party city/state/zip to lat/lng databases are based on the US Census Tiger data. Start with the Census 2000, the zip code data was replaced with ZCTA - which is an approximation of zipcodes. Here's an explanation from the Census site (from 2011):
ZIP Code Tabulation Areas (ZCTAs™) are a new statistical entity developed by the U.S. Census Bureau for tabulating summary statistics from Census 2000. This new entity was developed to overcome the difficulties in precisely defining the land area covered by each ZIP Code®. Defining the extent of an area is necessary in order to accurately tabulate census data for that area.
ZCTAs are generalized area representations of U.S. Postal Service (USPS) ZIP Code service areas. Simply put, each one is built by aggregating the Census 2000 blocks, whose addresses use a given ZIP Code, into a ZCTA which gets that ZIP Code assigned as its ZCTA code. They represent the majority USPS five-digit ZIP Code found in a given area. For those areas where it is difficult to determine the prevailing five-digit ZIP Code, the higher-level three-digit ZIP Code is used for the ZCTA code. For more information, please refer to the ZCTA (FAQ) Frequently Asked Questions Web page.
The link below is an updated explanation (2013):
http://www.census.gov/geo/reference/zctas.html
The OpenGeoCode.Org team
ADDED 12/17/13: Our (FREE) state/city/zip dataset (CSV) can be found at the link below. It is derived from public domain "government" datasets:
http://www.opengeocode.org/download.php#cityzip
Google offers this as a lookup. Can you do ajax calls from your app?
It's called webservices.
http://code.google.com/apis/maps/documentation/webservices/index.html
You'd want to use the Google Geocoding api.
It's simple to use, make a call to this url:
http://maps.googleapis.com/maps/api/geocode/json?address=sydney&sensor=false
Change "address=" to whatever you need (ie the city state and zip code)
It can also reply in xml. just change json to xml
http://code.google.com/apis/maps/documentation/geocoding/
Example Result
{
"status": "OK",
"results": [ {
"types": [ "locality", "political" ],
"formatted_address": "Sydney New South Wales, Australia",
"address_components": [ {
"long_name": "Sydney",
"short_name": "Sydney",
"types": [ "locality", "political" ]
}, {
"long_name": "New South Wales",
"short_name": "New South Wales",
"types": [ "administrative_area_level_1", "political" ]
}, {
"long_name": "Australia",
"short_name": "AU",
"types": [ "country", "political" ]
} ],
"geometry": {
"location": {
"lat": -33.8689009,
"lng": 151.2070914
},
"location_type": "APPROXIMATE",
"viewport": {
"southwest": {
"lat": -34.1648540,
"lng": 150.6948538
},
"northeast": {
"lat": -33.5719182,
"lng": 151.7193290
}
},
"bounds": {
"southwest": {
"lat": -34.1692489,
"lng": 150.5022290
},
"northeast": {
"lat": -33.4245980,
"lng": 151.3426361
}
}
}
} ]
}
Then all you need to do is open up results[0].geometry.location.lat, and results[0].geometry.location.lng
[EDIT 8/3/2015]
The free non-commercial ZIP Code database I mentioned below has moved to softwaretools.com. Note: greatdata.com still has the premium ZIP Code data for enterprises.
Just a small note. Most of these 3rd party city/state/zip to lat/lng databases are based on the US Census Tiger data. [Andrew]
I'm a developer for a commercial ZIP Code Database company (GreatData). For low-end data, Andrew's recommendation is correct and if you know your way around census data, it's pretty easy to get it. Just know it may initially take some hours to get it right. If you prefer not to do the work yourself, you can get our free/non-commercial version here (it's pretty much what Andrew is suggesting with minor enhancements. It's updated every couple months).
For a really good explanation on what is missing in it (and more importantly, what's missing in most all low-end ZIP Code data that is based on census ZCTA data) versus a commercial grade, see here.
ps - regarding suggestions to use Google's API, I see this suggested a lot but unless you're displaying it in a google map, this violates Googles TOS. Specifically: "The Geocoding API may only be used in conjunction with a Google map; geocoding results without displaying them on a map is prohibited." You'll find StackOverFlow has several threads on those who's sites have been blocked.
Hope this is beneficial
This is an old thread, but I was looking for the same information recently and also came across this free database:
http://federalgovernmentzipcodes.us
Check out zcta. You can draw the geographic boundaries of a zip code using their data.
If you have a small number of US cities you can easily build up your own database from Google which gives you the co-ordinates straight on the search page without the need to follow any links, e.g. type:
chicago illinois longitude latitude

Resources