mongodb schema for rails app - ruby-on-rails

I am working on an internal app to do host/service discovery. The type of data I am storing looks like:
IP Address: 10.40.10.6
DNS Name: wiki-internal.domain.com
1st open port:
port 80 open|close
open port banner:
HTTP/1.1 200 OK
Date: Tue, 07 Jan 2014 08:58:45 GMT
Server: Apache/2.2.15 (CentOS)
X-Powered-By: PHP/5.3.3
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
And so on. My first thought is to just put it all in one document with a string that identifies what the data is like "port","80". After initial data collection I realized that there was a lot of data duplication because web server banners and such will often get reused. Also out of 8400 machines with ssh there are only 6 different banners.
Is there a better way to do the design the database with references so certain banners only get created once. Performance is a big issue since the database size will double in the next year. If possible I would like to keep historical banner information for trending.

MongoDB's flexible schema allows you to match the needs of your application. While we often talk about denormalizing for speed, you certainly can normalize to reduce redundancy and storage costs. From your initial analysis and concern over database size, it seems clear that factoring out the redundancy fits your application, in this case, store banners separately and reference them with small ints for _ids, etc.
So do what you need for your application, and store your data in MongoDB in the form that matches the needs of your application.

Related

What the consequences might be from opening and closing connections to DB on every http request

I have a huge project where data access layer opens and closes connections on every http request or even more frequently. A lot of components depends on this functionality with data access layer. In moments with peak traffic, we have up to 100 request/second. Database is MySQL or Postgresql.
The question is if someone met real issues with such approach of communication with DBs?

Is there a default limit on request message size in NuSoap?

Is there a default limit on request message size in NuSoap? I am asking this because when I send CSV data with size 194 KB using NuSOAP client to a NuSOAP server I get the following response from the server.
HTTP/1.1 100 Continue
HTTP/1.0 500 Internal Server Error
Date: Fri, 13 Apr 2012 04:36:36 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.6
Content-Length: 0
Connection: close
Content-Type: text/html
I have tried looking at the error log files for apache and PHP, but nothing can be found there.
I have been fighting with the issue for a few hours. And I have tried searching around for an answer. Some posts recommended increasing the memory limit in php.ini I did that with no luck. Your help is greatly appreciated.
--Abdul
I think message size will be limited rather by a PHP memory limit, than by some hardcoded value. At least I could send a 6.5MB string without any problems. When I tried to send a 8MB string I got an out of memory exception inside nusoap.php (my server has 64MB limit for PHP).

What techniques to use for server side data reception of large-scale mobile app

Fellow StackOverflowers,
we are building an iOS application that will record data which will have to be sent back to our server at certain times. The server will not be sending back any data to the client, other than confirmation that the data has been received successfully. Processing load on the server may become an issue, so we want to design our server/client communication such that overhead is kept as low as possible.
1) Would it be wise to use PHP to write the received data to filesystem/database? It's easy and maintainable, but may be a lot less efficient than, for example, a Java application in Glassfish (or a 'hand-coded' server daemon in C if we choose the raw socket connection).
2) Would it be wise to write the received data directly to the MySQL database (running on the same server), or do you think we should write the data to filesystem first and parse it to a database asynchronously to the reception of the data (i.e., at a time when the server has resources to spare)
3) Which seems wiser: to use a protocol such as HTTP or FTP, or to build our own server daemon and have the clients connect to a socket and push data over it like in this heavily simplified example:
SocketFD = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
write(SocketFD, theData, sizeOfTheData);
Or, as Krumelur points out, maybe this is a non-issue with regard to server load?
Thanks in advance!
The answer to all three these questions depends on your budget and how serious the load will be.
I think php isn't a wise choice. If you have the time and skill to write something in C or C++ or something, I'd recommend doing that. Especially because that would provide thread control. If you're budget doesn't reach that far, Java, as you suggested, would be a good option, or maybe Ruby or Python.
I would suggest using sqlite for storing the data in the app. If only part of the data is send and you can keep that part separate from the rest, consider putting all that data in a separate sqlite db. You can than send that entire file. If you need just a part of the data and are concerned with the server load so much, than I guess you have two options. Ether let the app create a sqlite file with all data to transfer and send that file. Or just send a serialized array.
On first thought I'd say you should use a sqlite db on the server side too, to ease the process of parsing from incoming data to db. On second thought that's a bad idea since sqlite doesn't support multithreading, and if you're load is going to be so huge that's not desirable.
Why not use websockets? There are daemons available in most languages. You could open a socket with every client that wishes to send data, than give a green light "send it now" whenever a thread for processing comes available. After the trafic is complete you dispose the connection. But this would only be efficient if the number of requests is so huge the server has to handle rescheduling so much that it would take more cpu than just do what Krumelur sugests.
I wonder what you're building, and how it will be producing such a massive server load!

Is there a way to tell server up time from http response?

I'm looking for a way to find out how long a server has been up based only on the page it sends back. Like, if I went to www.google.com, is there some sort of response header variable that tells how long the server I connected to has been up? I'm doubting there is, but never hurts to ask...
No, because HTTP, as a protocol, doesn't really care. In any case, 50 different requests to google.com may end up at 50 different servers.
If you want that information, you need to build it into the application, something like "http://google.com/uptime" which will deliver Google's view of how long it's been up - Google will probably do this as a static page showing the date of their inception :-).
Not from HTTP.
It is possible, however, to discover uptime on many OSs by interrogating the the TCP packets received. Look to RFC 1323 for more information. I believe that a timestamp header is incremented by some value with every transaction, and reset to zero on reboot.
Caveats: it doesn't work with all OSs, and you've got to track servers over time to get accurate uptime data.
Netcraft does this: see here for a vague description:
The 'uptime' as presented in these
reports is the "time since last
reboot" of the front end computer or
computers that are hosting a site. We
can detect this by looking at the data
that we record when we sample a site.
We can detect how long the responding
computer(s) hosting a web site has
been running, and by recording these
samples over a long period of time we
can plot graphs that show this as a
line. Note that this is not the same
as the availability of a site.
Unfortunately there really isn't. You can check this for yourself by requesting the HTTP headers from the server in question. For example, from google.com you will get:
HTTP/1.0 200 OK
Cache-Control: private, max-age=0
Date: Mon, 08 Jun 2009 03:08:11 GMT
Expires: -1
Content-Type: text/html; charset=UTF-8
Server: gws
Online tool to check HTTP headers:
http://network-tools.com/default.asp?prog=httphead&host=www.google.com
Now if it's your own server, you can create a script that will report the uptime, but I don't think that's what you were looking for.
To add to what Pax said, there are a number of third party services that monitor site up-time by attempting to access server resources at specific intervals. They maintain some amount of history in their own databases, and then report back to you when those resources are inaccessible.
I use Uptime Party for monitoring a server.

Scaling with a cluster- best strategy

I am thinking about the best strategy to scale with a cluster of servers. I know there is no hard and fast rules, but I am curious what people think about these scenarios:
cluster of combination app/db servers that are round robin (with failover) balanced using dnsmadeeasy. the db's are synced using replication. Has the advantage that capacity can be augmented easily by adding another server to the cluster, and it is naturally failsafe.
cluster of app servers, again round robin load balanced (with failover) using dnsmadeeasy, all reporting to a big DB server in the back. easy to add app servers, but the single db server creates a single failure point. Could possible add a hot standby with replication.
cluster of app servers (as above) using two databases, one handling reads only, and one handling writes only.
Also, if you have additional ideas, please make suggestions. The data is mostly denormalized and non relational, and the DBs are 50/50 read-write.
Take 2 physical machines and make them Xen servers
A. Xen Base alpha
B. Xen Base beta
In each one do three virtual machines:
"web" server for statics(css,jpg,js...) + load balanced proxy for dynamic request (apache+mod-proxy-balancer,nginx+fair)
"app" server (mongrel,thin,passenger) for dynamic requests
"db" server (mySQL, PostgreSQL...)
Then your distribution of functions can be like this:
A1 owns your public ip and handle requests to A2 and B2
B1 pings A1 and takes over if ping fails
A2 and B2 take dynamic request querying A3 for data
A3 is your dedicated data server
B3 backups A3 second to second and offer readonly access to make copies, backups etc.
B3 pings A3 and become master if A3 becomes unreachable
Hope this can help you some way, or at least give you some ideas.
It really depends on your application.
I've spent a bit of time with various techniques for my company and what we've settled on (for now) is to run a reverse proxy/loadbalancer in front of a cluster of web servers that all point to a single master DB. Ideally, we'd like a solution where the DB is setup in a master/slave config and we can promote the slave to master if there are any issues.
So option 2, but with a slave DB. Also for high availability, two reverse proxies that are DNS round robin would be good. I recommend using a load balancer that has a "fair" algorithm instead of simple round robin; you will get better throughput.
There are even solutions to load balance your DB but those can get somewhat complicated and I would avoid them until you need it.
Rightscale has some good documentation about this sort of stuff available here: http://wiki.rightscale.com/
They provide these types of services for the cloud hosting solutions.
Particularly useful I think are these two entries with the pictures to give you a nice visual representation.
The "simple" setup:
http://wiki.rightscale.com/1._Tutorials/02-AWS/02-Website_Edition/2._Deployment_Setup
The "advanced" setup:
http://wiki.rightscale.com/1._Tutorials/02-AWS/02-Website_Edition/How_do_I_set_up_Autoscaling%3f
I'm only going to comment on the database side:
With a normal RDBMS a 50/50 read/write load for the DB will make replication "expensive" in terms of overhead. For almost all cases having a simple failover solution is less costly than implementing a replicating active/active DB setup. Both in terms of administration/maintenance and licensing cost (if applicable).
Since your data is "mostly denormalized and non relational" you could take a look at HBase which is an OSS implementation of Google Bigtable, a column based key/value database system. HBase again is built on top of Hadoop which is an OSS implementation of Google GFS.
Which solution to go with depends on your expected capacity growth where Hadoop is meant to scale to potentially 1000s of nodes, but should run on a lot less as well.
I've managed active/active replicated DBs, single-write/many-read DBs and simple failover clusters. Going beyond a simple failover cluster opens up a new dimension of potential issues you'll never see in a failover setup.
If you are going for a traditional SQL RDBMS I would suggest a relatively "big iron" server with lots of memory and make it a failover cluster. If your write ratio shrinks you could go with a failover write cluster and a farm of read-only servers.
The answer lies in the details. Is your application CPU or I/O bound? Will you require terabytes of storage or only a few GB?

Resources