Avro schema definition with two primitive types - avro

Can I define an avro schema with 2 primitve types? Like so:
{ "type": "record"
"name": "MyData",
"fields" : [
{"name":"my_number","type":["int","long"]}, <<<<< THIS
{"name":"my_string","type":["null","string"]}
]
}
Note <<<< THIS is just to highlight how i want to define it.
Or since I could get long values I should just use long?

Yes, this sort of type is known as a Union, see https://avro.apache.org/docs/current/spec.html#Unions

Related

Swift $sort order for BSON RealmSwift document

TLDR:
how can I write a document with a definitive order for the key value pairs to pass as key to the $sort stage?
I stumbled across very strange behaviour in my aggregate query that I am running in Swift (using RealmSwift package) through:
db.collection(withName: "Example")..aggregate(pipeline: aggregatePipeline) { ... }
This is the stage that is behaving very strangely:
let aggregatePipeline: [Document] = [
[
"$sort": [ "seen": 1 , "score": -1, "_id": -1 ]
]
]
The results I get are sorted by these three keys, but in seemingly arbitrary sorting order each time I run the query. So sometimes the "seen" sorting is prioritised (as it should be) but sometimes it sorts first by "_id", or "score". When I print the stage right after defining it by running:
print(aggregatePipeline[0])
I do see a the probable reason for those odd results...
first run:
["$sort": Optional(RealmSwift.AnyBSON.document(["_id": Optional(RealmSwift.AnyBSON.int64(-1)), "seen": Optional(RealmSwift.AnyBSON.int64(1)), "score": Optional(RealmSwift.AnyBSON.int64(-1))]))]
second run:
["$sort": Optional(RealmSwift.AnyBSON.document(["seen": Optional(RealmSwift.AnyBSON.int64(1)), "score": Optional(RealmSwift.AnyBSON.int64(-1)), "_id": Optional(RealmSwift.AnyBSON.int64(-1))]))]
how can I write this query with a definitive sorting order? Or more simply: how can I write a document with a definitive order for the key value pairs to pass as key to the $sort stage? I found nothing in the relevant docs or especially this where it should definitely be stated how this could be accomplished, since this would be the most sensible way, but obviously is the wrong way of doing it.

Partitions not in metastore ERROR on Athena

I'm trying to partition data by a column. However, when I run the the query MSCK REPAIR TABLE mytable, it returns error
Partitions not in metastore: city:countrycode=AFG city:countrycode=AGO city:countrycode=AIA city:countrycode=ALB city:countrycode=AND city:countrycode=ANT city:countrycode=ARE
I created the table from Avro by this query:
CREATE external table city (
ID int,
Name string,
District string,
Population int
)
PARTITIONED by (CountryCode string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
"fields": [
{
"name": "ID",
"type": "int"
},
{
"name": "Name",
"type": "string"
},
{
"name": "countrycode",
"type": "string"
},
{
"name": "District",
"type": "string"
},
{
"name": "Population",
"type": "int"
}
],
"name": "value",
"namespace": "world.city",
"type": "record"
}
')
STORED AS AVRO
LOCATION "s3://mybucket/city"
My partition look like s3://mybucket/city/countrycode=ABC
This is an old question, and Athena seems to have added a warning message on this, but in case anybody else misses the first several times they try something similar...
Here is the message Athena gives when you create the table:
Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more.
It seems that the codes you are using to partition don't work with Hive (I was doing something similar, partitioning by a grouping code). So, instead of MSCK REPAIR TABLE, you need to run an ALTER TABLE for each partition (see: https://docs.aws.amazon.com/athena/latest/ug/partitions.html)
ALTER TABLE city ADD PARTITION (CountryCode='ABC') location 's3://mybucket/city/ABC/' ;
...and you'll have to run that each time you add new county code bucket.
You definitely need a trailing slash in your location:
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Maybe also try lowercase for the partition column PARTITIONED by (countrycode string).
Did you try to add the partitions manually in Glue Catalog or via Crawler? Did this work?

Is there a way to dynamically generate nodes from JSON with apoc.load.json procedure?

I would like to create a set of nodes and relationships from a JSON document. Here is sample JSON:
{"records": [{
"type": "bundle",
"id": "bundle--1",
"objects": [
{
"type": "evaluation",
"id": "evaluation--12345",
"name": "Eval 1",
"goals": [
"test"
]
},
{
"type": "subject",
"id": "subject--67890",
"name": "Eval 2",
"goals": [
"execute"
]
},
{
"type": "relationship",
"id": "relationship--45678",
"relationship_type": "participated-in",
"source_ref": "subject--67890",
"target_ref": "evaluation--12345"
}
}]
}
And I would like that JSON to be represented in Neo similar to the following:
(:evaluation {properties})<-[:RELATIONSHIP]-(:subject {properties})
Ultimately I would like to have a model that represents the evaluation, subject, and relationship generated via a few cypher calls with as little outside manipulation as possible. Is it possible to use the apoc.create.* set of calls to create the necessary nodes and relationships from this JSON? I have tried something similar to the following to get this JSON to load and I can get it to create nodes of an arbitrary, in this case "object", type.
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value
UNWIND value.objects as object
MERGE (o:object {id: object.id, type: object.type, name: object.name})
RETURN count(*)
I have tried changing the JSONPath expression to filter different record types but it is difficult to run a Goessner path like $.records..objects[?(#.type = 'subject')] thanks to the embedded quotes. This would also lead to multiple runs (I have 15 or so different types) against the real JSON, which could be very time consuming. The LoadJSON docs have a simple filter expression and there is a blog post that shows how to parse stackoverflow but the JSON objects are keyed in a way that is easy to map in cypher. Is there a cypher trick or APOC I should be aware of that can help me solve this problem?
I would approach this as a two-pass method:
First pass: create the nodes for evaluation and subject. You could use apoc.do.case/when if helpful
Second pass: only scan for relationship and then do a MATCH to find the evaluation and subject nodes based on the source_ref and target_ref, and then MERGE or CREATE the relationship to connect them.
Like this you're not impacted by situations such as the relationship coming before the nodes it connects etc. or how many items you've got within objects
As Lju pointed out, the apoc.do.case function can be used to create a set of conditions to check, followed by a cypher statement. Combining that with another apoc call requires the returns from each apoc call to be handled properly. My answer ended up looking like the following:
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value as records
UNWIND records.objects as object
CALL apoc.do.case(
[object.type="evaluation", "MERGE (:Evaluation {id: object.id}) ON CREATE SET Evaluation.id = object.id, Evaluation.prop1 = object.prop1",
object.type="subject", "MERGE (:Subject {id: object.id}) ON CREATE SET Subject.id = object.id, Subject.prop1 = object.prop1",
....]
"<default case cypher query goes here>", {object:object}
)
YIELD value RETURN count(*)
Notice there are two apoc calls that YIELD. Use aliases to help the parser differentiate between objects. The documentation for the apoc.do.case is a little sparse but describes the syntax for the statement. It looks like there are other ways to accomplish this task but with smaller JSON files, and a handful of cases, this works well enough.

Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

how to get the distance or radian between two point on the earth with lng and lat?

how to get the distance or radian between two point on the earth with lng and lat?
You probably don't want mapReduce in this case but actually the aggregation framework. Apart from the general first stage query you can run via $geoNear which is more efficient in your purpose.
db.places.aggregate([
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ -88 , 30 ]
},
"distanceField": "dist"
}},
{ "$match": {
"loc": {
"$centerSphere": [ [ -88 , 30 ] , 0.1 ]
}
}}
])
Or frankly, because the initial $geoNear stage will "project" an additional field into the document containing the "distance" from the queried "point of origin", then you can just "filter" on that element in a subsequent stage:
db.places.aggregate([
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": [ -88 , 30 ]
},
"distanceField": "dist"
}},
{ "$match": {
"dist": { "$lte": 0.1 }
}}
])
Since this is one option that can "produce/project" a value representing the distance in the result then that satisfies your first criteria. The "chaining" nature of the "aggregation framework" allows "additional filtering" or any other operation you need to perform after the filtering of the initial query.
So $geoWithin works just as well in the aggregation framework under a $match stage as it would in any standard query since it is not "dependant" on an "index" of geospatial origin to be present. It performs better in an initial query with one, but it does not need it.
Since your requirement is the "distance" from the point of origin, then the most logical thing to do is to perform an operation that will return such information. Such as this does.
Would love to include all of the relevant links in this response, but as a new responder then two links is all I am allowed for now.
One more relevant note:
The measurement of "distance" or "radius" in any operation is dependant on how your data is stored. If it is in a "legacy" or "key/pair or plain array" format then the value will be expressed in "radians", otherwise where the data is expressed in GeoJSON format on the "location" then the "distance data" is expressed in "meters" instead.
That is an important consideration given the libraries implemented by the MongoDB service and how this interacts with the data as you have it stored. There is of course documentation on this in the official resources should you care to look at that properly. And again, I cannot add those links at this time, unless this response sees some much needed love.
https://github.com/mongodb/mongo/blob/master/src/third_party/s2/s2latlng.cc#L37
GetDistance() return a S1Angle, S1Angle::radians() will return the radians.
This belong to s2-geometry-library(google code will close,i export it to my github. Java).

Resources