Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am getting Log Data from various web applications in the following format:
Session Timestamp Event Parameters
1 1 Started Session
1 2 Logged In Username:"user1"
2 3 Started Session
1 3 Started Challenge title:"Challenge 1", level:"2"
2 4 Logged In Username:"user2"
Now, a person wants to carry out analytics on this log data (And would like to receive it as a JSON blob after appropriate transformations). For example, he may want to receive a JSON blob where the Log Data is grouped by Session and TimeFromSessionStart and CountOfEvents are added before the data is sent so that he can carry out meaningful analysis. Here I should return:
[
{
"session":1,"CountOfEvents":3,"Actions":[{"TimeFromSessionStart":0,"Event":"Session Started"}, {"TimeFromSessionStart":1, "Event":"Logged In", "Username":"user1"}, {"TimeFromSessionStart":2, "Event":"Startd Challenge", "title":"Challenge 1", "level":"2" }]
},
{
"session":2, "CountOfEvents":2,"Actions":[{"TimeFromSessionStart":0,"Event":"Session Started"}, {"TimeFromSessionStart":2, "Event":"Logged In", "Username":"user2"}]
}
]
Here, TimeFromSessionStart, CountOfEvents etc. [Let's call it synthetic additional data] will not be hard coded and I will make a web interface to allow the person to decide what kind of synthetic data he requires in the JSON blob. I would like to provide a good amount of flexibility to the person to decide what kind of synthetic data he wants in the JSON blob.
I am expecting the database to store around 1 Million rows and carry out transformations in a reasonable amount of time.
My question is regarding choice of Database. What will be the relative advantages and disadvantages of using SQL Database such as PostgreSQL v/s using NoSQL Database such as MongoDB. From whatever I have read till now, I think that NoSQL may not be able to provide enough flexibility of adding additional synthetic data. On the other hand, I may face issues of flexibility in data representation if I use SQL Database.
I think the storage requirement for both MongoDB and PostgreSQL will be comparable since I will have to build similar indices (probably!) in both situations to speed up querying.
If I use PostgreSQL, I can store the data in the following manner:
Session and Event can be string, Timestamp can be date and Parameters can be hstore(key value pairs available in PostgreSQL). After that, I can use SQL queries to compute the synthetic (or additional) data, store it temporarily in variables in a Rails Application (which will interact with PostgreSQL database and act as interface for the person who wants the JSON blob) and create JSON blob from it.
Another possible approach is to use MongoDB for storing the log data and use Mongoid as an interface with Rails application if I can get enough flexibility of adding additional synthetic data for analytics and some performance/storage improvements over PostgreSQL. But, in this case, I am not clear of what will be the best way to store log data in MongoDB. Also, I read that MongoDB will be somewhat slower than PostgreSQL and is mainly meant to run in background.
Edit:
From whatever I have read in the past few days, Apache Hadoop seems to be a good choice as well because of it's greater speed over MongoDB (being multi-threaded).
Edit:
I am not asking for opinions and would like to know the specific advantages or disadvantages of using a particular approach. Therefore, I don't think that the question is opinion based.
You should check out logstash / kibana from elasticsearch. The primary use case for that stack is collecting log data, storing it, analyzing it.
http://www.elasticsearch.org/overview/logstash/
http://www.elasticsearch.org/videos/kibana-logstash/
Mongo is a good choice too if you are looking at building it all yourself, but I think you could find that the products from elasticsearch very well could solve your needs and allow the integration you need.
MongoDB is well suited to your task and its document storage is more flexible than the rigid SQL table structure.
Below, please find a working test using Mongoid
that demonstrates comprehension of your log data input,
easy storage as documents in a MongoDB collection,
and analytics using MongoDB's aggregation framework.
I've chosen to put the parameters in a sub-document.
This matches your sample input table more closely and simplifies the pipeline.
The resulting JSON is slightly modified,
but all of the specified calculations, data, and grouping is present.
I've added a test for an index on parameter Username that demonstrates an index on a subdoc field.
This is adequate for specific fields that you want to index,
but a completely general index can't be done on keys, you would have to restructure to values.
I hope that this helps and that you like it.
test/unit/log_data_test.rb
require 'test_helper'
require 'json'
require 'pp'
class LogDataTest < ActiveSupport::TestCase
def setup
LogData.delete_all
#log_data_analysis_pipeline = [
{'$group' => {
'_id' => '$session',
'session' => {'$first' => '$session'},
'CountOfEvents' => {'$sum' => 1},
'timestamp0' => {'$first' => '$timestamp'},
'Actions' => {
'$push' => {
'timestamp' => '$timestamp',
'event' => '$event',
'parameters' => '$parameters'}}}},
{'$project' => {
'_id' => 0,
'session' => '$session',
'CountOfEvents' => '$CountOfEvents',
'Actions' => {
'$map' => { 'input' => "$Actions", 'as' => 'action',
'in' => {
'TimeFromSessionStart' => {
'$subtract' => ['$$action.timestamp', '$timestamp0']},
'event' => '$$action.event',
'parameters' => '$$action.parameters'
}}}}
}
]
#key_names = %w(session timestamp event parameters)
#log_data = <<-EOT.gsub(/^\s+/, '').split(/\n/)
1 1 Started Session
1 2 Logged In Username:"user1"
2 3 Started Session
1 3 Started Challenge title:"Challenge 1", level:"2"
2 4 Logged In Username:"user2"
EOT
docs = #log_data.collect{|line| line_to_doc(line)}
LogData.create(docs)
assert_equal(docs.size, LogData.count)
puts
end
def line_to_doc(line)
doc = Hash[*#key_names.zip(line.split(/ +/)).flatten]
doc['session'] = doc['session'].to_i
doc['timestamp'] = doc['timestamp'].to_i
doc['parameters'] = eval("{#{doc['parameters']}}") if doc['parameters']
doc
end
test "versions" do
puts "Mongoid version: #{Mongoid::VERSION}\nMoped version: #{Moped::VERSION}"
puts "MongoDB version: #{LogData.collection.database.command({:buildinfo => 1})['version']}"
end
test "log data analytics" do
pp LogData.all.to_a
result = LogData.collection.aggregate(#log_data_analysis_pipeline)
json = <<-EOT
[
{
"session":1,"CountOfEvents":3,"Actions":[{"TimeFromSessionStart":0,"Event":"Session Started"}, {"TimeFromSessionStart":1, "Event":"Logged In", "Username":"user1"}, {"TimeFromSessionStart":2, "Event":"Started Challenge", "title":"Challenge 1", "level":"2" }]
},
{
"session":2, "CountOfEvents":2,"Actions":[{"TimeFromSessionStart":0,"Event":"Session Started"}, {"TimeFromSessionStart":2, "Event":"Logged In", "Username":"user2"}]
}
]
EOT
puts JSON.pretty_generate(result)
end
test "explain" do
LogData.collection.indexes.create('parameters.Username' => 1)
pp LogData.collection.find({'parameters.Username' => 'user2'}).to_a
pp LogData.collection.find({'parameters.Username' => 'user2'}).explain['cursor']
end
end
app/models/log_data.rb
class LogData
include Mongoid::Document
field :session, type: Integer
field :timestamp, type: Integer
field :event, type: String
field :parameters, type: Hash
end
$ rake test
Run options:
# Running tests:
[1/3] LogDataTest#test_explain
[{"_id"=>"537258257f11ba8f03000005",
"session"=>2,
"timestamp"=>4,
"event"=>"Logged In",
"parameters"=>{"Username"=>"user2"}}]
"BtreeCursor parameters.Username_1"
[2/3] LogDataTest#test_log_data_analytics
[#<LogData _id: 537258257f11ba8f03000006, session: 1, timestamp: 1, event: "Started Session", parameters: nil>,
#<LogData _id: 537258257f11ba8f03000007, session: 1, timestamp: 2, event: "Logged In", parameters: {"Username"=>"user1"}>,
#<LogData _id: 537258257f11ba8f03000008, session: 2, timestamp: 3, event: "Started Session", parameters: nil>,
#<LogData _id: 537258257f11ba8f03000009, session: 1, timestamp: 3, event: "Started Challenge", parameters: {"title"=>"Challenge 1", "level"=>"2"}>,
#<LogData _id: 537258257f11ba8f0300000a, session: 2, timestamp: 4, event: "Logged In", parameters: {"Username"=>"user2"}>]
[
{
"session": 2,
"CountOfEvents": 2,
"Actions": [
{
"TimeFromSessionStart": 0,
"event": "Started Session",
"parameters": null
},
{
"TimeFromSessionStart": 1,
"event": "Logged In",
"parameters": {
"Username": "user2"
}
}
]
},
{
"session": 1,
"CountOfEvents": 3,
"Actions": [
{
"TimeFromSessionStart": 0,
"event": "Started Session",
"parameters": null
},
{
"TimeFromSessionStart": 1,
"event": "Logged In",
"parameters": {
"Username": "user1"
}
},
{
"TimeFromSessionStart": 2,
"event": "Started Challenge",
"parameters": {
"title": "Challenge 1",
"level": "2"
}
}
]
}
]
[3/3] LogDataTest#test_versions
Mongoid version: 3.1.6
Moped version: 1.5.2
MongoDB version: 2.6.1
Finished tests in 0.083465s, 35.9432 tests/s, 35.9432 assertions/s.
3 tests, 3 assertions, 0 failures, 0 errors, 0 skips
MongoDB is an ideal database for this.
Create a collection for your raw log data.
Use one of Mongo's powerful aggregation tools, and output the aggregated data to another collection (or multiple output collections, if you want different buckets or views of the raw data)
You can either do the agg offline, with a set of pre-determined possibilities that users can pull from, or do it on demand/ad hoc, if you can tolerate some latency in your response.
http://docs.mongodb.org/manual/aggregation/
Related
First of all, I am very new to Ruby so go easy on me! I have the parsed JSON in my sourceHash variable and I am trying to group data by the "displayName" property. The JSON format is something like this (I've simplified it without changing the structure):
{
"results": [
{
"id": "12345",
"title": "my blog post",
"history": {
"createdOn": "2017-09-18 15:38:26",
"createdBy": {
"userName": "myUserName",
"displayName": "Michael W."
}
}
},
{ ... same stuff for some other blog post ... },
{ ... same stuff for some other blog post ... },
{ ... same stuff for some other blog post ... }
]
}
Basically, there are two things I want to do.
Imagine this list as "list of blog posts including the author data" of them.
Find the person who posted the most amount of entries
Get the top 10 bloggers, ordered by their blog post count, descending
So the first would look something like this:
Michael W. (51 posts)
However, the second one would look like this:
Michael Wayne (51 posts)
Emilia Clarke (36 posts)
Charlize Theron (19 posts)
Scarlett Johansson (7 posts)
I've played around these queries, trying to merge my LINQ logic into this, but I failed... (I'm a Ruby noob so be easy!)
sourceHash = #mainData["results"]
hashSetPrimary = sourceHash.group_by{|h| h['history']['createdBy']['displayName']}
return hashSetPrimary
So long story short, I am trying to write to separate queries that would group the data by those criteria, any help is appreciated as I can't find any proper way to do it.
Firstly, you need to look at your hash syntax. When you define a hash using h = { "foo": "bar" }, the key is not actually a string, but rather a symbol. Therefore accessing h["foo"] is not going to work (it will return nil); you have to access it as h[:foo].
So addressing that, this does what you need:
sourceHash = #mainData[:results]
hashSetPrimary = sourceHash.group_by{ |h| h.dig(:history, :createdBy, :displayName) }
.map { |k, v| [k, v.count] }
.sort_by(&:last)
return hashSetPrimary
Hash#dig requires Ruby 2.3+. If you are running on a lower version, you can do something like this instead of dig:
h[:history] && h[:history][:createdBy] && h[:history][:createdBy][:displayName]
How do I connect my database to API.AI
Making every sentence into INTENT and creating entities for each doesn't seem to be a good idea? So what is the best possible way to go about?
As far as I know it is not possible yet, but you can switch to row mode and past your entities inCVS or JSON format OR import a JSON/CSV file containing all your entities.
The file should look like below (JSON format):
[
{
"value": "val1",
"synonyms": [
"syn1",
"syn2"
]
},
{
"value": "val2",
"synonyms": [
"syn21",
"syn22"
]
},
]
So you can image of writing a small job that reads entities from you DB and make a JSON/CSV file according the wanted format.
Once the job done, this process may dramatically facilitate the creation of your entities on api.ai.
If you use a webhook for an intent, you can pass params to your endpoint where you can do all the queries to your db
I did a demo where I was querying news (cheating as I was getting it from the web, but I could plug a DB).
The was getting requests such as:
"What are the latest news about France"
latest and France would be params that I send through to the webhook endpoint.
You would get the following JSON sent your endpoint by API.AI
"result": {
"source": "agent",
"resolvedQuery": "latest news about France",
"action": "show.news",
"actionIncomplete": false,
"parameters": {
"adjective": "latest",
"subject": "France"
}
Then you can query all the news for France and order them by latest
In my understanding the idea is to create entities that are "placeholders" for the values you need to query.
Then you teach the AI with few examples by tagging in the request what did the person ask. Let say someone asks:
"what is the oldest news about France?"
The AI may not know what is oldest thus you tell it is is an adjective and from now on you can get oldest as a param
I'm using RoR for a couple of time. But after read many json specification for example jsonapi.org and json-schema.org I have the next question: What is the default JSON specification used in RoR ?
Because when you render a json in RoR you get this for example:
post: {
id: 1,
title: 'Stackoverflow rised 1 billion of alien money',
description: 'blablabla'
}
Is it a good practice if I used the default response in RoR when I'm creating an API ?
One specific thing that may or may not be helpful...
One thing that bothers me about the default rendering of JSON w/ Rails is that it leaves the key names unquoted when serializing a Hash, which is (technically) not valid JSON. The way to fix this is to add
ActiveSupport::JSON.unquote_hash_key_identifiers = false
to a configuration file like environment.rb. Once you've done that, serializing
my_hash = { post: { id: 1, title: 'Stackoverflow rised 1 billion of alien money', description: 'blablabla' } }
to JSON would change to
post: {
"id": 1,
"title": 'Stackoverflow rised 1 billion of alien money',
"description": 'blablabla'
}
vs. what you have above without the quotes.
I have been able to retrieve event details using
#graph = Koala::Facebook::API.new(access_token)
#eventSummary = #graph.get_object(eventId)
and getting users list invited using
#eventSummary = #graph.get_connections(eventId, "invited")
I want to get count for all user invited, maybe, Declined and accepted for the event. for which i'm using
#eventSummary = #graph.get_connections(eventId, "invited?summary=1")
which again giving me the list of users only. when used graph.facebook like
https://graph.facebook.com/***eventId***/invited?access_token=*****access_token****&summary=1
i'm getting the count in result.
{
"data": [
{
"name": "xyz",
"rsvp_status": "attending",
"id": "10000000123"
}
],
"paging": {
"next": "https://graph.facebook.com/***eventId***/invited?summary=1&access_token=***accesstoken***&limit=5000&offset=5000&__after_id=100004389574321"
},
"summary": {
"noreply_count": 0,
"maybe_count": 0,
"declined_count": 0,
"attending_count": 1,
"count": 1
}
}
for just solving purpose i'm getting result using fql, as:
#eventSummary = #graph.get_object("fql", :q => "SELECT all_members_count, attending_count, declined_count, not_replied_count, unsure_count FROM event WHERE eid = #{eventId}")
But this is not convenient to use.
Can anyone please help, what am i doing wrong ? To get Event RSVP counts.
I'm using rails v.3.2, for facebook using Koala gem.
Thanks in advance.
I've seen it too, that when you request the event itself, those attendee count fields aren't included. You can use the second parameter to specifically ask for them though:
#eventSummary = #graph.get_object(eventId)
#eventSummary.merge(#graph.get_object(eventId, fields: "attending_count,declined_count,interested_count"))
The first call gets your "standard" event details, something like this:
{"description"=>"Blah blah blah",
"name"=>"My Event Will Rock",
"place"=>{"name"=>"Venue Bar",
"location"=>{"city"=>"Citytown",
"country"=>"United States",
"latitude"=>43.05308,
"longitude"=>-87.89614,
"state"=>"WI",
"street"=>"1216 E Brady St",
"zip"=>"53202"},
"id"=>"260257960731155"},
"start_time"=>"2016-04-22T21:00:00-0500",
"id"=>"1018506428220311"}
The second call gets just those "additional" requested fields:
{"attending_count"=>3,
"declined_count"=>0,
"interested_count"=>12,
"id"=>"1018506428220311"}
You can store them in separate variables, or as I'm suggesting above, use the hash#merge method. (There shouldn't be any problematic overlap between the keys of these two hashes.)
Alternatively, you can get all these details in one request by explicitly requesting everything you want.
I'm working to have Rails 3 respond with a JSON request which will then let the app output the search results with the jQuery template plugin...
For the plugin to work, it needs this type of structure:
[
{ title: "The Red Violin", url: "/adadad/123/ads", desc: "blah yada" },
{ title: "Eyes Wide Shut", url: "/adadad/123/ads", desc: "blah yada" },
{ title: "The Inheritance", url: "/adadad/123/ads", desc: "blah yada" }
]
In my Rails 3 controller, I'm getting the search results which come back as #searchresults, which contain either 0 , 1 , or more objects from the class searched.
My question is how to convert that to the above structure (JSON)...
Thank you!
Update
Forgot to mention. The front-end search page will need to work for multiple models which have different db columns. That's why I'd like to learn how to convert that to the above to normalize the results, and send back to the user.
I am not really sure what is the problem here, since you can always call ".to_json" on every instance or collection of instances or hash, etc.
You can use .select to limit the number of fields you need, ie:
Object.select(:title, :url, :desc).to_json
I am guessing that the #searchresults is ActiveRecord::Relation, so you probably can use:
#searchresults.select(:title, :url, :desc).to_json