How to design HBase schema for Twitter data? - twitter

I have following Twitter data and I want to design a schema for the same .The queries which I would need to perform would be following:
get tweets volume for time interval,tweets with corresponding user info,tweets with corresponding topic info etc... Based on the below data ,anyone tell where designing of schema is correct.. (make rowkey as id+timestamp, column family as user ,others grouped into primary column . Any Suggestions ?
{
"created_at":"Tue Feb 19 11:16:34 +0000 2013",
"id":303825398179979265,
"id_str":"303825398179979265",
"text":"Unleashing Innovation Conference Kicks Off - Wall Street Journal (India) http:\/\/t.co\/3bkXJBz1",
"source":"\u003ca href=\"http:\/\/dlvr.it\" rel=\"nofollow\"\u003edlvr.it\u003c\/a\u003e",
"truncated":false,
"in_reply_to_status_id":null,
"in_reply_to_status_id_str":null,
"in_reply_to_user_id":null,
"in_reply_to_user_id_str":null,
"in_reply_to_screen_name":null,
"user":{
"id":948385189,
"id_str":"948385189",
"name":"Innovation Plaza",
"screen_name":"InnovationPlaza",
"location":"",
"url":"http:\/\/tinyurl.com\/ee4jiralp",
"description":"All the latest breaking news about Innovation",
"protected":false,
"followers_count":136,
"friends_count":1489,
"listed_count":1,
"created_at":"Wed Nov 14 19:49:18 +0000 2012",
"favourites_count":0,
"utc_offset":28800,
"time_zone":"Beijing",
"geo_enabled":false,
"verified":false,
"statuses_count":149,
"lang":"en",
"contributors_enabled":false,
"is_translator":false,
"profile_background_color":"131516",
"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg",
"profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg",
"profile_background_tile":true,
"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg",
"profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg",
"profile_link_color":"009999",
"profile_sidebar_border_color":"FFFFFF",
"profile_sidebar_fill_color":"EFEFEF",
"profile_text_color":"333333",
"profile_use_background_image":true,
"default_profile":false,
"default_profile_image":false,
"following":null,
"follow_request_sent":null,
"notifications":null
},
"geo":null,
"coordinates":null,
"place":null,
"contributors":null,
"retweet_count":0,
"entities":{
"hashtags":[
],
"urls":[
{
"url":"http:\/\/t.co\/3bkXJBz1",
"expanded_url":"http:\/\/dlvr.it\/2yyG5C",
"display_url":"dlvr.it\/2yyG5C",
"indices":[
73,
93
]
}
],
"user_mentions":[
]
},
"favorited":false,
"retweeted":false,
"possibly_sensitive":false
}

If you are 100% sure that the ID is unique, you could this one as your row key to store the bulk of the data:
303825398179979265 -> data_CF
You column family data_CF would be defined in these lines:
"created_at":"Tue Feb 19 11:16:34 +0000 2013"
"id_str":"303825398179979265"
...
"user_id":948385189 { take note here I'm denormalizing your dictionary }
"user_name":"Innovation Plaza"
It gets a little trickier for the lists. The solution is to put something that will make it unique prefixed by the category:
"entities_hashtags_":"\x00" { Here \x00 is a dummy value }
For the URL, if the ordering is not important, you may prefix it with a UUID. It will guarantee that it is unique.
The advantage with this approach is if you need to update a field in this data, it will be done atomically since HBase guarantees row atomicity.
For the second question, if you need instantaneous aggregated information, you will have to precompute it and store it as you said in another table. If you want this data to be generated through M/R, you may put the timestamp + row id if it is to be time based. By topic would be something like topic + row id. This allows you to write Prefix scans with start stop row that will include only the time range or the topic you are interested.
Have fun !

Related

How to do grouping in elasticsearch with searchkick rails

I have articles data indexed to elastic as follows.
{
"id": 1011,
"title": "abcd",
"author": "author1"
"status": "published"
}
Now I wanted to get all the article id grouped by status.
Result should someway look like this
{
"published": [1011, 1012, ....],
"draft": [2011],
"deleted": [3011]
}
NB: I tried normal aggs (Article.search('*',aggs: [:status], load: false).aggs) , it just giving me the count of each items in, I want ids in each item instead
#Crazy Cat
You can transform you query in this way:
sort(Inc/Dec order) your response from ES over field "status".
Only Ask ES query to return only ID Field and status.
Now the usage of sorting would be it would sort your response to like this: [1st N results of "deleted" status, then N+1 to M results to "draft" and then M+1 to K results to "published"].
Now the advantages of this technique:
You will get flagged ids field of every document over which you can apply operations in you application.
Your query would be light weight as compared to Aggs query.
This way you would also get the metdata of every document ike docId of that document.
Now the Disadvantages:
You would always have to give a high upper bound of your page size, but You can also play around with count coming in the metadata.
Might take a bit more of network size as it returns redundant status in every document.
I Hope this redesign of your query might be helpful to you.

Rails - Create order and order-rows with REST api

I just have a question regarding how to implement some logic.
Im building a API that allows the client to create orders.
This is solved by a OrderController#create so no problem!
Now, the issue is that an order can have many order-rows, all the relations are set correct but where should i create the order-rows in for the order?
Should the OrderController handle this or should i have a new controller that creates the order-rows for the particular order?
The clients post is sending the following json-data:
{
"status": "paid",
"total_sum": 20,
"payment": "card",
"order_rows": [
{
"id": 12,
},
{
"id":13
}
]
}
I ran into something similar with a project I'm working on now. The best (and long term simplest) solution was definitely to make a whole new model/controller.
*Order
status (should be an int or enum probably)
total (should loop through all order rows and total)
payment (should be an int or enum probably)
has_many order_rows
**OrderRow
belongs_to Order
item_sku
item_name
item_descr
item_cost
etc. etc.
This allows you to easily search for not just items, but orders that include items by name or sku, orders that include items by description.
Your totals are dynamic.
You can retrieve total order numbers or movement numbers on a per item basis.
It is so much easier to create and update orders.
The benefits go on.
It can easily be scoped;
scope :this_orders_rows, -> (order_id) {where(order_id: order_id)}
And it saves you from having to parse through hashes and arrays everytime.
To get technical about it, your order_controller should control ONLY your orders. If you start adding in a heap of other code to read through the arrays its going to get VERY cluttered. It's always better to move that to some other area.

Can I pull a list of info out of an email?

I get a daily email that lists upcoming appointments, and their length. The number of appointments vary from day to day.
The emails go like this:
================
Today's Schedule
9:30 AM
3h
Brazilian Blowout
[Client #1 name]
12:30 PM
1h
Women's Cut
[Client 2 name]
6:00 PM
45m
Men's Cut
[Client #3 name]
Projected Revenue
===================
I want to create an event in a Google Calendar for each appointment, and it seems like zapier MIGHT be able to do this, but all the help resources I can find are very general in nature.
Is this do-able on Zapier? If so, any nudges in the right direction would be awesome.
Any thoughts greatly appreciated.
I had some time to kill and enjoy the odd challenge. So I have put together a solution that should do what you are looking for. I will break it down by steps.
TEMPLATE
Zapier Trigger - Step 1
Type: Trigger
Module: Gmail
Criteria: User Dependent
Comments: For the trigger zap you will want to use a Gmail specific trigger, something to the effect of "execute trigger on emails titled 'xyz'", or "emails labeled 'xyz'" if you setup a filter in your inbox.
Input screenshot:
Output Screenshot:
Zapier Action - Step 2
Type: Action
Module: Code (Python 3)
Comments: The Code offered by Zapier executes whatever (properly written) code you place in its container. It is especially handy as it allows you to incorporate data from previous steps in it through the use of a dictionary variable titled 'input_data'. Zapier offers the Code module in two languages: Javascript and Python. As I am most familiar with Python my solution for this step was written in Python. I will append the code to the end of this answer. Using the data held in the body of the email (retrieved in step 1) we can execute some string manipulations and datetime conversions to break apart the email into its component parts and pass those on to the following Action Step: Create Calendar Event.
Input Screenshot:
Output Screenshot:
Zapier Action - Step 3
Type: Action
Module: Google Calendar - Create Event
Comments: Using the data outputted from the previous code step we can fill out the required fields for creating a new appointment.
Input Screenshot:
Output Screenshot:
PYTHON CODE
from datetime import timedelta, date, datetime
'''
Goal: Extract individual appointment details from variable length email
Steps:
Remove all extraneous and new line characters.
Isolate each individual appointment and group its relevant details.
Derive appointment start and end times using appointment time and duration.
Return all appointments in a list.
'''
def format_appt_times(appt_dict):
appt_start_str = appt_dict.get("appt_start")
appt_dur_str = appt_dict.get("appt_length")
# isolate hour and minutes from appointment time
appt_s_hour = int(appt_start_str[:appt_start_str.find(":")])
if ("pm" in appt_start_str.lower()):
appt_s_hour = 12 if appt_s_hour + 12 >= 24 else appt_s_hour + 12
appt_s_min = int(appt_start_str[appt_start_str.find(":") + 1 :
appt_start_str.find(":") + 3])
# isolate hour and minutes from duration time
appt_d_hour = 0
appt_d_min = 0
if ("h" in appt_dur_str):
appt_d_hour = int(appt_dur_str[:appt_dur_str.find("h")])
if ("m" in appt_dur_str):
appt_d_min = int(appt_dur_str[appt_dur_str.find("m") - 2 : appt_dur_str.find("m")])
# NOTE: adjust timedelta hours depending on your relation to UTC
# create datetime objects for appointment start and end times
time_zone = timedelta(hours=0)
tdy = date.today() - time_zone
duration = timedelta(hours=appt_d_hour, minutes=appt_d_min)
appt_start_dto = datetime(year=tdy.year,
month=tdy.month,
day=tdy.day,
hour=appt_s_hour,
minute=appt_s_min)
appt_end_dto = appt_start_dto + duration
# return properly formatted datetime as string for use in next step.
return (appt_start_dto.strftime("%Y-%m-%dT%H:%M"),
appt_end_dto.strftime("%Y-%m-%dT%H:%M"))
def partition_list(target, part_size):
for data in range(0, len(target), part_size):
yield target[data : data + part_size]
def main():
# Remove all extraneous and new line characters.
email_body = input_data.get("email_body")
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
appointment_list = []
# Isolate each individual appointment and group its relevant details.
for text in partition_list(email_body, 4):
template = {
"appt_start" : text[0],
"appt_end" : None,
"appt_length" : text[1],
"appt_title" : text[2],
"appt_client" : text[3]
}
appointment_list.append(template)
for appt in appointment_list:
appt["appt_start"], appt["appt_end"] = format_appt_times(appt)
return appointment_list
return main()
I am not sure of your familiarity with Python, or programming more generally, but the comments in the code explain what each section is doing. If you have any specific questions regarding aspects of the code let me know. Assuming your email template does not change this setup should work exactly as needed. Let me know if anything is unclear.
UPDATE
I thought it best to address your question in the original answer should anyone else have similar questions.
explaining how this code is removing the extra characters:
There is actually a fair bit going on in the first line, so I will do my best to break it down, and provide resources where necessary.
The code in question:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
First step here was to break the text into manageable chunks. I did so with the line email_body.splitlines() which, by default, breaks strings into a list at each newline character found (you can specify your own delimiter).
If we were to inspect the list at this moment its contents would be something of the following:
["================", "", "Today's Schedule", "", "9:30 AM", "", "3h", ..., "[Client #3 name]", "", "Projected Revenue", "", "==================="]
You will notice there is a fair amount of information in there that we really don't want.
First lets look at the "" elements. These are left over as a result of the blank lines between each line of text, which even though they are blank do still have newline characters at the end of them. There a number of ways you could address this within python. We could simply write a for-loop to go through and copy all elements that are not "" to a new list.
To me this felt like additional work, and besides, Python offers list comprehension for just such a scenario. I won't go too deep into list comprehension as there is a lot that can be said about it, and in more insightful ways than I could muster, but it essentially allows you to provide logic against a set of 'data' to form a list. In this case, I specifically wanted to filter out the "" elements returned from the call to splitlines().
And so you will see I address this with the following line
[text for text in email_body.splitlines() if text != ""]
With that we have a list as above less the "" elements. Now we must turn our attention towards the more 'dynamic' garbage strings. Again there are a number of ways to do this. A, not particularly flexible, option could be to simply store the strings we want to remove in variables something to the effect of:
garb_1 = "==================="
garb_2 = "Projected Revenue"
garb_3 = ...
and once again filter the list with yet another for-loop. I instead chose to leverage Python's list unpacking idiom. Which allows us to 'unpack' list objects (and I believe tuples) into variables. As an example:
one, two, three = ["a", "b", "c"]
I'm sure you can guess what is happening above, as long as we provide the same number of variables as are in the list we can 'unpack' it in this fashion. But wait! In our case we don't know how long the list is going to be as it is entirely dependent on the number of appointments you have for any given day. Well this is where star unpacking enters to elevate the functionality. Using my code as the example:
head,delin,*email_body,delin,foot = [text for text in email_body.splitlines() if text != ""]
The *, in plain-English, is saying "I don't know how many elements to expect just give me all of them in a list". As we know that there will always be two lines of garbage at the beginning and end of the email we can assign them to throw away variables and capture everything in between using our variable length *email_body container.
With all of this complete we now have a list with only the data we are looking to capture. If, as you say, there are additional lines of garbage before or after the email_body, you can simply add additional throw away variables to account for them.
Once again feel free to ask any follow up questions.
Michael
Resources
List Comprehension
Star Unpacking

Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

Merged Google Fusion Tables - Select Query with WHERE clause on Merge Key Error

I have merged two Fusion Tables together on the key "PID". Now I would like to do a SELECT query WHERE PID = "value'. The error comes back that no column with the name PID exists in the table. A query for another column gives this result:
"kind": "fusiontables#sqlresponse",
"columns": [
"\ufeffPID",
"Address",
"City",
"Zoning"
],
"rows": [
[
"001-374-079",
"# LOT 15 MYSTERY BEACH RD",
"No_City_Value",
"R-1"
],
It appears that the column name has been changed from "PID" to "\ufeffPID", which no matter how many attempts to get the syntax to read a GET Url, I keep getting an error.
Is there any limitation with querying on the key of a merged table? Since I cannot seem to get the name correct for the column a work around would be to use the Column ID but that does not seem to be an option either. Here is the URL:
https://www.googleapis.com/fusiontables/v1/query?sql=SELECT 'PID','Address','City','Zoning' FROM 1JanYNl3T45kFFxqAmGS0BRgkopj4AS207qnLVQI WHERE '\ufeffPID' = 001-493-078&key=myKey
Cheers
I have no explanation for \ufeff in there; that's the Unicode character 'ZERO WIDTH NO-BREAK SPACE', so it's conceivable that it's actually there in the column name because it would be invisible in the UI. So, first off I would recommend changing the name in the base tables and see if that works.
Column IDs for merge tables have a different form than for base tables. An easy way to get them is to add the filters of interest to one of your tabs (any type will do) and then do Tools > Publish. The top text ("Send a link in email or IM") has a query URL that has what you need. Run it through a URL decoder such as http://meyerweb.com/eric/tools/dencoder/ and you'll see the column ID for PID is col0>>0.
Rod

Resources