Database to store & process client logs efficiently - ruby-on-rails

So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.

I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/

For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.

If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.

Related

Lead time for changes

I am working on some project where i need to generate lead time for changes per application, per day..
Is there any prometheus metric that provides lead time for changes ? and How we integrate it into a grafana dashboard?
There is not going to be a metric or dashboard out of the box for this, the way I would approach this problem is:
You will need to instrument your deployment code with the prometheus client library of your choice. The deployment code will need to grab the commit time, assuming you are using git, you can use git log filtered to the folder that your application is in.
Now that you have the commit date, you can do a date diff between that and the current time (after the app has been deployed to PRD) to get the lead time of X seconds.
To get it into prometheus, use the node_exporter (or windows_exporter) and their textfile collectors to read textfiles that your deployment code writes and surface them for prometheus to scrape. Most of the client libraries have logic to help you write these files, and even if there is not, the format of the textfiles is pretty easy to use by writing the files directly.
You will want to surface this as a gauge metric, and have a label to indicate which application was deployed. The end result will be a single metric that you can query from grafana or set up alerts that will work for any application/folder that you deploy. To mimic the dashboard that you linked to, I am pretty sure you will want to use the over_time functions.
I also want to note that it might be easier for you to store the deployment/lead time in a sql database/something other than prometheus and use that as a data source into grafana. For applications that do not deploy frequently you would easily run into missing series when querying by using prometheus as a datastore, and the overhead of setting up the node_exporters and the logic to manage the textfiles might outweigh the benefits if you can just INSERT into a sql table.

Is there a solution like Apache ActiveMQ on top of HDFS?

I want to store webpages fetched by a web crawler. I don't have any random access. so whenever i want to read the stored data, i read from the start to the end.
We have tried solutions like HBase but one of the most good things about HBase is random access to records which we don't need at all. HBase has not proved to be stable for us after 1.5 years of test.
I want just a stack or queue on top of HDFS becuase the number of webpages is about 1 billion. I don't even want the queue behaviour of ActiveMQ i just want to be able to store the webpages so that i can read them all in case of a failure.
I don't want to use Files because i don't want to handle things like file rotations, file consistencies and ...
It is worth to mention that we need HDFS so we can run MapReduce jobs on the data when we want to send all the stored data to a solr cluster and to have good things like redundancy and availability by HDFS.
Is there a service on HDFS that just stores JMS records without any functionality for random access and without transparent view of records?

Amazon Web Services - What do i need?

Sorry for long explanation but I'm trying to find the right moves for days, any help would be much appreciated.
My IOS app will be used daily, An image and some data will be displayed to the user and will be saved not to connect again. So an user will use approximately 30kb per day.
Now, for testing, I'm using a basic hosting plan for MSSQL and Web Service. On SQL Server, I have 4 tables and an average of 5 columns (I mean It's not a complicated database)(and also I have a subquery). And I'm using .net web service for communicating from IOS app. And lastly, one image for one day is hosted.
I've tried to explain basically but It's expected to reach at least 1 million user after a short period of time according to my big clients.
So I want to start with AWS not to fail but really I don't know which products/settings do i need (from few users to millions) and how to start to AWS EC2. Also I want to specify that after AWS's documents and googling, I'm confused.
At least please show me the way. Thanks..
You want autoscaling both in the webtier and in database resources. You also likely want high availability (i.e. trans-AZ, trans-regional deployment). This answer might help point you in the right direction. Start with ElasticBeanstalk and RDS (if you can afford it). They both abstract out huge swathes of autoscaling.
Also pay close attention to the ElasticBeanstalk architectural overview. It'll help you distinguish between the web tier of your application, any application layers, and the database layer of your stack.

patterns to feed a remote MySQL with data

I would like to hear about from the community a nice pattern to the following problem.
I had a "do-everything" server, which were webserver, mysql, crawlers server. Since two or three weeks, using monitoring tools, i saw that always when my crawlers were running, my load average was going over 5 (a 4 core server, would be ok to have until 4.00 as load). So, i've got another server and i want to move my crawlers to there. My question is. As soon as i have the data crawled in my crawler server, i have to insert in my database. And i wouldn't like to open a remote connection and insert it in the database, since i prefer to use the Rails framework, btw i'm using rails, to keep easier to create all relationships, and etc.
problem to be solved:
server, has the crawled data (bunch of csv files) and i want to move it to a remote server and insert it in my db using rails.
restriction: I don't want to run mysql (slave + master) since it would require a deeper analysis to know where happens more write operations.
Ideas:
move the csvs from crawlers to remove server using (ssh, rsync) and importing it during the day
write an API in the crawler server that my remote server can pull (many times at day) and import the data
any other idea or good patterns around this theme?
With a slight variation to the second pattern you have noted you could have a API in your web-app-server/db-server. Which the crawler will use to report in his data. He could do this in batches, real-time or only in a specific window of time (day/night time...etc).
This pattern will let the crawler decide when to report in the data. rather than having the web-app do the 'polling' for data.

Are there any stable and production quality nosql datastores?

Are there are production quality nosql stores that I can use on a production system. I have looked at cassandra, tokyodb, couchdb etc but none of them seem to be ready for deployments on production like environments. I am talking thousands of requests per minute and lots of reads/writes/updates. My only concern is speed and service times. Does anybody know of production systems that use nosql stores effectively ? Does anybody know of a nosql store that is backed by a big enterprise like Google/Yahoo/ IBM ?
Cassandra handles thousands of requests (including write-mostly workloads) per second, per machine, and its scaling-by-adding-machines has been there since day 1.
Here is a thread about Cassandra use in production and in-production-soon at dozens of companies: http://n2.nabble.com/Cassandra-users-survey-td4040068.html#a4040068
We're also adding more docs all the time, like http://wiki.apache.org/cassandra/Operations.
I think the NoSQL systems are an excellent choice if I you 'only' care about speed and service time (and not or less about stuff like consistency and transactions). Facebook uses Cassandra.
"Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes." http://highscalability.com/product-facebooks-cassandra-massive-distributed-store
I think CouchDb isn't really speedy, maybe you can use MongoDB: http://www.mongodb.org/display/DOCS/Production+Deployments
Also worth consideration is using a traditional RDBMS like MySQL to store schema-less. This method gives you the stability of a proven database server like MySQL with the flexibility a NoSQL solution.
Check out this blog posting on how FriendFeed does this.
BerkeleyDB is backed by Oracle
Using the native C interface one can reach close to 1 million read requests per second.
By the way, when you say thousands requests per minute, any 'normal' DB should be able to handle that easily too.
Redis is worth giving a try as Github uses redis to manage a heavy queue of background jobs.
My first instinct would be BerkeleyDB, with each application node on a SAMBA network to facilitate ACID conformance & network use. It also sports a SQLite interface. Other poster cites MemcacheDB also having BDB inside.
Another unique option would be OrientDB, also has a SQL interface, lots of network & cluster features.

Resources