Partitions not in metastore ERROR on Athena - avro

I'm trying to partition data by a column. However, when I run the the query MSCK REPAIR TABLE mytable, it returns error
Partitions not in metastore: city:countrycode=AFG city:countrycode=AGO city:countrycode=AIA city:countrycode=ALB city:countrycode=AND city:countrycode=ANT city:countrycode=ARE
I created the table from Avro by this query:
CREATE external table city (
ID int,
Name string,
District string,
Population int
)
PARTITIONED by (CountryCode string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
"fields": [
{
"name": "ID",
"type": "int"
},
{
"name": "Name",
"type": "string"
},
{
"name": "countrycode",
"type": "string"
},
{
"name": "District",
"type": "string"
},
{
"name": "Population",
"type": "int"
}
],
"name": "value",
"namespace": "world.city",
"type": "record"
}
')
STORED AS AVRO
LOCATION "s3://mybucket/city"
My partition look like s3://mybucket/city/countrycode=ABC

This is an old question, and Athena seems to have added a warning message on this, but in case anybody else misses the first several times they try something similar...
Here is the message Athena gives when you create the table:
Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more.
It seems that the codes you are using to partition don't work with Hive (I was doing something similar, partitioning by a grouping code). So, instead of MSCK REPAIR TABLE, you need to run an ALTER TABLE for each partition (see: https://docs.aws.amazon.com/athena/latest/ug/partitions.html)
ALTER TABLE city ADD PARTITION (CountryCode='ABC') location 's3://mybucket/city/ABC/' ;
...and you'll have to run that each time you add new county code bucket.

You definitely need a trailing slash in your location:
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
Maybe also try lowercase for the partition column PARTITIONED by (countrycode string).
Did you try to add the partitions manually in Glue Catalog or via Crawler? Did this work?

Related

Avro schema definition with two primitive types

Can I define an avro schema with 2 primitve types? Like so:
{ "type": "record"
"name": "MyData",
"fields" : [
{"name":"my_number","type":["int","long"]}, <<<<< THIS
{"name":"my_string","type":["null","string"]}
]
}
Note <<<< THIS is just to highlight how i want to define it.
Or since I could get long values I should just use long?
Yes, this sort of type is known as a Union, see https://avro.apache.org/docs/current/spec.html#Unions

Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

The graph section of Cypher response, remains blank

I noticed for some queries the response populates the "graph" section as follows
}
],
"graph": {
"nodes": [
{
"id": "68",
"labels": [
"ROOM"
],
"properties": {
"id": 15,
"name": "Sun and Snow",
but for other queries, this "graph" section is not returning with nodes/relationships and associated labels/properties even though the "data" section returns valid output
Does it convey anything about the quality of the cypher query ?
It depends on what you return from your query. If you return nodes and relationships, you'll get a graph. If you return scalars such as n.name or r.weight, you don't get a graph.
Are you talking about the HTTP requests from the web UI or requests that you are making yourself?
The graph key is controlled via the resultDataContents option when making a request. You can see the documentation for that here:
http://neo4j.com/docs/stable/rest-api-transactional.html#rest-api-return-results-in-graph-format
You can request multiple formats for the result ("row" and "REST" are other examples)

Rails: How to model objects with a mix of both fixed and dynamic data

I am building a site with a database of users. I am using arbor.js to build a graph for each user. The graph is a tree-like structure with edges and nodes that looks something like this (I had an image ready to go but apparently don't have enough reputation yet):
vehicle
/ \
/ \
car truck
/
/
sedan
and is represented by the following JSON:
{
"nodes":{
"vehicle":{
"color":"black",
"label":"vehicle"
},
"car":{
"color":"orange",
"label":"car"
},
"truck":{
"color":"red",
"label":"truck"
},
"sedan":{
"color":"red",
"label":"sedan"
}
},
"edges":{
"vehicle":{
"car":{
"weight":5,
"directed":true,
"color":"orange"
},
"truck":{
"weight":5,
"directed":true,
"color":"red"
}
},
"car":{
"sedan":{
"weight":2,
"directed":true,
"color":"orange"
}
}
}
}
Each graph will always have a nodes and edges object with dynamic nodes and edges. Their respective attributes (color, label, weight etc.) will be fixed.
I am trying to figure out how best to model this data for each user. I am using Rails with MongoDB (Mongoid), because I understand that MongoDB can save objects as documents in the database. I'm pretty sure each user will have a graph model which I can define, but beyond that I'm not sure how to handle the nodes and edges.
I'm guessing the solution will involve has_many, embeds_many, or possibly serialize, but I'm unclear on how to use these with a mix of fixed and dynamic data.
Also, it would be nice to retrieve the data exactly the way it looks above so I can easily create the graph when loading it from disk.
Any help would be appreciated!
In case all you need is to perform graph operations only per user. You can follow this model.
{
"nodes": [{"type": "vehicle", "color":"black", "label": "vehicle"},
{"type": "car", "color":"orange", "label": "car"},
{"type":"truck", "color":"red", "label":"truck"},
{"type": "sedan", "color":"red", "label":"sedan"}
],
"edges": {
"vehicle": [
{"type": "car", "weight": 5, "color": "orange"},
{"type": "truck", "weight": 5, "color": "red"}
],
"car": [
{"type": "sedan", "weight": 2, "color": "orange"}
],
"sedan": [],
"truck":: []
}
It is like you are storing a multimap for edges. Also it is self suggestive whether its a bi-directional or not. For individual user's graph to be processed independently, it is a pretty natural model you can go with.
Tell me if it meets your requirement. Also, until you specify what kind of queries you want to perform over your collection, its not possible to suggest a model.
Also if you are starting your project you can explore some graph databases as well like neo4j

Fusion Table API bug, not able to handle WHERE clauses with equality on numeric fields?

I'm getting strange results from the FusionTable API. Specifically, it seems unable to handle a simple select statement with equality constraints on numeric values. Any query I try of the following form:
SELECT COUNT() FROM 1Nynh5pPrj1q8JqbalppAm-qzAsgKvL0ZRala7VI WHERE AGE=41
yields zero records:
{
"kind": "fusiontables#sqlresponse",
"columns": [
"count()"
],
"rows": [
[ "0" ]
]
}
By contrast, a range constraint works fine:
SELECT COUNT() FROM 1Nynh5pPrj1q8JqbalppAm-qzAsgKvL0ZRala7VI WHERE AGE>40.99 AND AGE<41.01
{
"kind": "fusiontables#sqlresponse",
"columns": [
"count()"
],
"rows": [
[ "362" ]
]
}
Maybe the numbers underneath aren't integers? SELECT AGE FROM 1Nynh5pPrj1q8JqbalppAm-qzAsgKvL0ZRala7VI WHERE AGE>40.99 AND AGE<41.01 returns
{
"kind": "fusiontables#sqlresponse",
"columns": [
"AGE"
],
"rows": [
[ "41" ],
[ "41" ],
[ "41" ],
...359 more...
]}
Now, maybe there's some floating point representation error going on? I thought that small integers can be represented exactly as floats (even if some decimal fractions, e.g. 0.1, are repeating decimals in binary).
It seems unlikely that a bug in Fusion Table SQL would get by without being discovered by others, so perhaps it's there's something unique to how this particular FusionTable is loaded?
UPDATE:
While the query appears to fail using the new Fusion Table API above, it succeeds using the old Fusion Table SQL API (recently deprecated):
www.google.com/fusiontables/api/query?sql=SELECT%20COUNT()%20FROM%204579147%20WHERE%20AGE%20LIKE%2041
which returns this JSON response:
count()
362
Also, the new FusionTable API appears confused by numeric values:
SELECT COUNT() FROM 4579147 WHERE AGE = 41 yields 0 (incorrect)
SELECT COUNT() FROM 4579147 WHERE AGE = "41" yields 0 (incorrect)
SELECT COUNT() FROM 4579147 WHERE AGE MATCHES 41 yields 362
SELECT COUNT() FROM 4579147 WHERE AGE LIKE 41 yields 362
SELECT COUNT() FROM 4579147 WHERE AGE LIKE "41" yields 362
SELECT COUNT() FROM 4579147 WHERE AGE LIKE "%41%" yields 362
This is a recently introduced bug that will be fixed shortly. As described it does only affect numeric equality queries with aggregation. Sorry for the inconvenience!
There is nothing wrong with AGE = 41 in that table:
https://www.google.com/fusiontables/DataSource?snapid=S580613IY6U
Something about the count() is causing the query to fail

Resources