I am setting greenplum for the first time. I am following the documentation. I want to setup connection from sql to greenplum database. Currently figuring out what's the best way to achieve this. I came across gpfdist and gpload.
How are the two different? Since both use external tables, both work on slaved nodes and are used for parallel loading. So Is there any advantage of using one over other?
Answering to your question for " I want to setup connection from sql to greenplum database"...
It's ambiguous for which SQL database you are referring to.
Also, there is no direct connectivity drivers available to connect non-greenplum database to greenplum database.
However if you want to migrate data from Oracle to Greenplum, then you can use Informatica's fastclone tool.
To answer your second part of question regarding gpfdist and gpload. GPFDIST is a file distributed process which runs on host system and it serves file parallely to many segments. While initialising external table to read/ write from file, you will need to specify which process will serve the file, In your case it will be GPFDIST. There are other processes too like FTP, GPHDFS, HTTP.
GPLOAD is a wrapper utility which makes your work easier by automatically creating gpfdist processes and external tables.
Also be aware that GPLOAD can only create readable external tables.
gpfdist n gpload or same. In gpfdist you do it manually while in gpload you can automate the activities via maiking entries in config(yaml file) file.
GPLOAD is a wrapper around GPFDIST. so when you load data via gpload it will internally use gpfdist only.
If you want to load/ migrate data from any other RDBMS to Greenplum and you are using any ETL or migration tool, it will use normal copy command and while loading/migrating if you enable gpload(now a days in the latest version of most of the ETL tool and migration tool support gpload feature when you migrate/load data to Greenplum) it will load data in parallel fashion via using gpfdist internally.
Related
While creating some basic workflow using KNIME and PSQL I have encountered problems with selecting proper node for fetching data from db.
In node repo we can find at least:
PostgreSQL Connector
Database Reader
Database Connector
Actually, we can do the same using 2) alone or connecting either 1) or 2) to node 3) input.
I assumed there are some hidden advantages like improved performance with complex queries or better overall stability but on the other hand we are using exactly the same database driver, anyway..
There is a big difference between the Connector Nodes and the Reader Node.
The Database Reader, reads data into KNIME, the data is then on the machine running the workflow. This can be a bad idea for big tables.
The Connector nodes do not. The data remains where it is (usually on a remote machine in your cluster). You can then connect Database nodes to the connector nodes. All data manipulation will then happen within the database, no data is loaded to your machine (unless you use the output port preview).
For the difference of the other two:
The PostgresSQL Connector is just a special case of the Database Connector, that has pre-set configuration. However you can make the same configuration with the Database Connector, which allows you to choose more detailed options for non standard databases.
One advantage of using 1 or 2 is that you only need to enter connection details once for a database in a workflow, and can then use multiple reader or writer nodes. I'm not sure if there is a performance benefit.
1 offers simpler connection details with the bundled postgres jdbc drivers than 2
I have streaming data coming into my consumer app that I ultimately want to show up in Hive/Impala. One way would be to use Hive based APIs to insert the updates in batches to the Hive Table.
The alternate approach is to write the data directly into HDFS as a avro/parquet file and let hive detect the new data and suck it in.
I tried both approaches in my dev environment and the 'only' drawback I noticed was high latency writing to hive and/or failure conditions I need to account for in my code.
Is there an architectural design pattern/best practices to follow?
I want to know which is the best architecture to adopt for this case :
I have many shops that connect to a web application developed using Ruby on Rails.
internet is not reachable all the time
The solution was to develop an offline system which requires installing a local copy of the distant database.
All this wad already developed.
Now what I want to do :
Work always on the local copy of the database.
Any change on the local database should be synchronized with distant database.
All the local copies should have the same data in other local copies.
To resolve this problem I thought about using a JMS like software eventually Rabbit MQ.
This consists on pushing any sql request into a JMS queue that will be executed on the distant instance of the application which will insert into the distant DB and push the insert or SQL statement into another queue that will be read by all the local instances. This seems complicated and should slow down the application.
Is there a design or recommendation that I must apply to resolve this kind of problem ?
You can do that but essentially you are developing your own replication engine. Those things can be a bit tricky to get right (what happens if m1 and m3 are executed on replica r1, but m2 isn't?) I wouldn't want to develop something like that unless you are sure you have the resources to make it work.
I would look into existing off-the shelf replication solution. If you are already using a SQL DB it probably has some support for it. Look here for more details if you are using MySQL
Alternatively, if you are willing to explore other backends, I heard that CouchDB has great support for replication. I also heard of people using git libraries to do that sort of thing.
Update: After your comment, I realize you already use MySql replication and are looking for solution for re-syncing the databases after being offline.
Even in that case RabbitMQ doesn't help you at all since it requires constant connection to work, so you are back to square one. Easiest solution would be to just write all the changes (SQL commands) into a text file at a remote location, then when you get connection back copy that file (scp, ftp, emaill or whatever) to master server, run all the commands there and then just resync all the replicas.
Depending on your specific project you may also need to make sure there are no conflicts when running commands from different remote location but there is no general technical solution to this. Again, depending on the project, you may want to cancel one of the transactions, notify the users that it happened and so on.
I would recommend taking a look at CouchDB. It's a non-SQL database that does exactly what you are describing automatically. It's used especially in phone applications that often don't have internet or data connectivity. The idea is that you have a local copy of a CouchDB database and one or more remote CouchDB databases. The CouchDB server then takes care of teh replication of the distributed systems and you always work off your local database. This approach is nice because you don't have to build your own distributed replication engine. For more details I would take a look at the 'Distributed Updates and Replication' section of their documentation.
What I want to do: My application has a full connection to a Derby DB, and I want to poke around in the DB (read-only) in parallel (using a different tool).
I'm not sure how Derby actually works internally, but I understand that I can have only 1 active connection to a Derby DB.
However, since the DB is only consisting of files on my HDD, shouldn't I be able to open additional connections to it, in read-only mode?
Are there any tools to do just that?
There are two possibilities how to run Apache Derby DB.
Embedded: You run DB within your application → only one connection possible
Client: You start DB as server in separate process → classic DB with many connections
You can recognize the type upon driver size. If the driver has more then 2MB that you use embedded version.
Update
When you startup the derby engine (server or embedded) it gets exclusive access to database files.
If you need to access a single database from more than one Java Virtual Machine (JVM), you will need to put a server solution in place. You can allow applications from multiple JVMs that need to access that database to connect to the server.
For details see Double-booting system behavior.
I realize this is an old question, but I thought I might add a little more detail on a solution since links in the currently accepted answer are broken.
It is possible to run the Derby Network Server within a JVM that is using the embedded database already. The code that is using the embedded Derby database doesn't need to change anything and can keep using the DB as is, but with the Derby Network Server started, other programs can connect to derby and access the database.
All you need to do is ensure that derbynet.jar is on the classpath
And then you can do one of the following
Include the following line in the derby.properties file: derby.drda.startNetworkServer=true
Specify the property as a system property at java start
java -Dderby.drda.startNetworkServer=true
You can use the NetworkServerControl API to start the Network Server from a separate thread within a Java application:
NetworkServerControl server = new NetworkServerControl();
server.start (new PrintWriter(System.out));
More details here: http://db.apache.org/derby/docs/10.9/adminguide/tadminconfig814963.html
Keep in mind that doing this does not enable any security on this connection, so it is not a good idea to do this on a production system. It is possible to add security though and that is documented here: http://db.apache.org/derby/docs/10.9/adminguide/cadminnetservsecurity.html
Two other ideas:
In your application, shut down the database and close the connection when the database is not actively in use. Then your application won't interfere with another tool which is trying to open the database.
Make a copy of your database, by taking a backup (you can do this while the database is open by your application), then restore that backup to a separate place on your disk. Then you can use another tool to access the copied database at your ease.
If you can afford the memory and do not need up-to-date data, then you can access read-only databases from multiple JVMs by creating in-memory copies:
ij> connect 'jdbc:derby:memory:memdb;restoreFrom=mydb';
In the application I am designing, I have to communicate with a device and store a history of data readings in a database. The device is essentially a sensor that spits out numbers via the serial port. The user end of the application is a RubyOnRails interface that allows the user to view this data and configure the device.
I am wondering what kind of connection between the database and the device you could recommend for this kind of a setup.
Up to this point, I had a custom application running on a host computer (a computer with the device connected directly through a serial port) that would serve as a bridge to a MySQL database. The application would connect directly to the MySQL database and execute queries. It works fairly well, but I am not sure if this is the best solution.
The only other alternative I see is to have an intermediate application that my custom application could connect to, instead of directly going to the database. This could be a part of the main application, or something separate. Would this be a better solution?
Would you recommend another approach?
Thank you,
I have a similar structure, although I fetch my data from a Web Service. The way I organize is:
Create classes in lib/imports, eg DailyDataImport, DailyDataSummarize (you can organize the hierarchy and names as per your wish or willingness).
Create a rake task under a new namespace, say import and add it to your cron job depending frequency. Take a look at Cron in Ruby. Its helpful.
This allows me to have a better control over what goes in my database.
Some questions to consider:
What schedule does the Device follow
to populate the data?
Do you need the data as-is or you
want a little control over it or you
need to process it, like summarizing
and aggregating etc.
MS SQL Server 2008 has great data synchronisation support.
SQL Server 2008 Express is free and can act as a replication subscriber (but not publisher) for clients.
Microsoft Sync Framework