Apache Drill in Docker container: java.net.BindException: Address already in use - docker

I'm using Apache Drill to convert csv data to parquet.
I want to do this in a distributed manner, so I spin up a Docker container, run code similar the example below to convert to CSV.
When I run one instance at a time, this works well. But when I spin up several containers simultaneously, the operation often fails with this stack trace:
Error: Failure in starting embedded Drillbit: java.net.BindException: Address already in use (state=,code=0)
java.sql.SQLException: Failure in starting embedded Drillbit: java.net.BindException: Address already in use
at org.apache.drill.jdbc.impl.DrillConnectionImpl.<init>(DrillConnectionImpl.java:131)
at org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)...
I don't know much about Drill - I haven't used it for anything before this.
I get the big idea that multiple instances of Drill cannot be running simultaneously, but these Docker containers shouldn't know about each other.
The one thing they have in common is that they write to a common (shared) output folder. But each file name is unique.
Can anyone shed some light on this?
Are there configuration settings I should look at?
The code I'm running is similar to this:
alter session set `store.format`='parquet';
CREATE TABLE dfs.tmp.`/fp9gr34f/parquet_tmp_output` AS
SELECT
CASE when columns[0]='source_file' or columns[0]='' then CAST(NULL AS VARCHAR(100)) else CAST(columns[0] as VARCHAR(100)) end as `source_file`,
CASE when columns[1]='column1' or columns[1]='' then CAST(NULL AS INT) else CAST(columns[1] as INT) end as `msg_command`,
CASE when columns[2]='column2' or columns[2]='' then CAST(NULL AS INT) else CAST(columns[2] as INT) end as `msg_length`
FROM dfs.`/path/to/my/file.csv`
OFFSET 1

Related

Uptodate list of running docker containers stated in an exported golang variable

I am trying to use the Golang SDK of Docker in order to maintain a slice variable with currently running containers on the local Docker instance. This slice is exported from a package and I want to use it to feed a web page.
I am not really used to goroutines and channels and that's why I am wondering if I have spotted a good solution for my problem.
I have a docker package as follows.
https://play.golang.org/p/eMmqkMezXZn
It has a Running variable containing the current state of running containers.
var Running []types.Container
I use a reload function to load the running containers in the Running variable.
// Reload the list of running containers
func reload() error {
...
Running, err = cli.ContainerList(context.Background(), types.ContainerListOptions{
All: false,
})
...
}
And then I start a goroutine from the init function to listen to Docker events and trigger the reload function accordingly.
func init() {
...
// Listen for docker events
go listen()
...
}
// Listen for docker events
func listen() {
filter := filters.NewArgs()
filter.Add("type", "container")
filter.Add("event", "start")
filter.Add("event", "die")
msg, errChan := cli.Events(context.Background(), types.EventsOptions{
Filters: filter,
})
for {
select {
case err := <-errChan:
panic(err)
case <-msg:
fmt.Println("reloading")
reload()
}
}
}
My question is, is it proper to update a variable from inside a goroutine (in terms of sync)? Maybe there is a cleaner way to achieve what I am trying to build?
Update
My concern here is not really about caching. It is more about hiding the "complexity" of the process of listening and update from the Docker SDK. I wanted to provide something like an index to easily let the end user loop and display currently running containers.
I was aware of data-races problems in threaded programs but I did not realize I was as actually in a context of concurrence here (I never wrote concurrent programs in Go before).
I effectively need to re-think the solution to be more idiomatic. As far as I can see, I have two options here: either protecting the variable with a mutex or re-thinking the design to integrate channels.
What means the most to me is to hide or encapsulate the method of synchronization used so the package users need not concern of how the shared state is protected.
Would you have any recommendations?
Thanks a lot for your help,
Loric
No, it is not idiomatic Go to share the Running variable between two goroutines. You do this by sharing it between the routine that runs your main function, and the listen function which is started with go—which spawns another goroutine.
Why, is because it breaks with
Do not communicate by sharing memory; instead, share memory by
communicating. ¹
So the design of the API needs to change in order to be idiomatic; you need to remove the Running variable and replace it with what? It depends on what you are trying to achieve. If you are trying to cache the cli.ContainerList because you need to call it often, and it might be expensive, you should implement a cache which is invalidated on each cli.Events.
What is your motivation?

Can't connect to CFS node

I removed (or decommisioned, can't remember) a DSE analytics node (with IP 10.14.5.50) a couple of months ago. When I now try to execute a dse shark (CREATE TABLE ccc AS SELECT ...) query I now receiving:
15/01/22 13:23:17 ERROR parse.SharkSemanticAnalyzer: org.apache.hadoop.hive.ql.parse.SemanticException: 0:0 Error creating temporary folder on: cfs://10.14.5.50/user/hive/warehouse/mykeyspace.db. Error encountered near token 'TOK_TMP_FILE'
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1256)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1053)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8342)
at shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:105)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at shark.SharkDriver.compile(SharkDriver.scala:215)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:347)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:240)
at shark.SharkCliDriver.main(SharkCliDriver.scala)
Caused by: java.lang.RuntimeException: java.io.IOException: Error connecting to node 10.14.5.50:9160 with strategy STICKY.
at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:216)
at org.apache.hadoop.hive.ql.Context.getExternalScratchDir(Context.java:270)
at org.apache.hadoop.hive.ql.Context.getExternalTmpFileURI(Context.java:363)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1253)
... 12 more
I guess the above error is due to my keyspace referring to the old node:
shark> DESCRIBE DATABASE mykeyspace;
OK
mykeyspace cfs://10.14.5.50/user/hive/warehouse/mykeyspace.db
Time taken: 0.997 seconds
Is there any way for me to fix this incorrect database path?
Tried (but failed) workaround to recreate the database: In cqlsh I created a keyspace thekeyspace and added a table thetable. I the opened up dse hive (and noticed that DESCRIBE DATABASE thekeyspace is giving me a correct cfs path). However, I am unable to drop the the database using DROP DATABASE thekeyspace.
Additional information:
I have no external tables in my keyspace.
Making the SELECT against the tables works.
Setting -hiveconf cassandra.host=WORKING_NODE_IP does not help.
The following commands return proper IP:s (ie. not X.X.X.50):
dsetool listjt
dsetool jobtracker
dsetool sparkmaster
I am getting the same error when I execute the query using dse hive.
No Shark variable is referring to X.X.X.50 when I execute set; in its REPL.
I am running DSE 4.5.
Stumbled across this page that says you need to TRUNCATE "HiveMetaStore"."MetaStore" (in cqlsh) after removing Hive nodes. That did the trick.

SSIS foreach loop takes wrong file

I'm developing a SSIS Package that copies contents of specific files to a database. In this package I mak heavy use of the foreach container. Today I came across a strange behavior and have no clue whats wrong. In one of the containers I filter for "VBFA*.txt". But for some reason the container also gets triggered for a file called "VBAP.D2014211.T204008397.R000564.txt". When I change any part of that filename it doesn't trigger the container anymore. Additionally there are plenty of other files that start with "VBAP" and don't trigger the container. What could be the reason for this behavior?
Here is the enumerators implementation:
<DTS:ForEachEnumerator>
<DTS:Property DTS:Name="ObjectName">{6E07E755-700D-4D7D-9550-E08DA5B81264}
</DTS:Property>
<DTS:Property DTS:Name="DTSID">
{f0ceed84-f95c-404c-8794-2eec0155d1a6}</DTS:Property>
<DTS:Property DTS:Name="Description"></DTS:Property>
<DTS:Property DTS:Name="CreationName">DTS.ForEachFileEnumerator.2</DTS:Property>
<DTS:ObjectData>
<ForEachFileEnumeratorProperties>
<FEFEProperty Folder="\\desoswi0204vs\etldata\transfers\out\DP"/>
<FEFEProperty FileSpec="VBFA*.txt"/>
<FEFEProperty FileNameRetrievalType="0"/>
<FEFEProperty Recurse="0"/>
</ForEachFileEnumeratorProperties>
</DTS:ObjectData>
</DTS:ForEachEnumerator>
I've checked the paths contents with dir /x and the short name of my file is wrong. For the file "VBAP.D2014211.T204008397.R000564.txt" the shortname is "VBFA08~1.TXT". The full result is:
01.08.2014 11:02 1.067.169 VBFA08~1.TXT VBAP.D2014211.T204008397.R000564.txt
I have absolutely no clue, what is happening here and how to stop it. This violates every rule I've found regarding the short filename creation. I leave this as the answer for everybody else who is comming accross this beahvior, which is also the case for c# Directory.GetFiles

Neo4j: Java API IndexHits<Node>.size() is 0

I'm trying to use the Java API for Neo4j but I seem to be stuck at IndexHits. If I query the DB with Cypher using
START n=node:types(type="Process") RETURN n;
I get all 2087 nodes of type "Process".
In my application I have the following lines
Index<Node> nodeIndex = db.index().forNodes("types");
IndexHits<Node> hits = nodeIndex.get("type", "Process");
System.out.println("Node index size: " + hits.size());
which leads my console to spit out a value of 0. Here, db is of course an instance of GraphDatabaseService.
I expected an object that included all 2087 nodes. What am I doing wrong?
The .size() question is just the prelude to my iterator
for(Node process : hits) { ... }
but that does not much when hits.size() == 0. According to http://api.neo4j.org/1.9.2/org/neo4j/graphdb/index/IndexHits.html this should be possible, provided there is something in hits.
Thanks in advance for your help.
I figured it out. Man, I feel so embarrassed...
It so happens that I had set up the DB_PATH to my default data folder, whereas the default storage folder is the default data folder plus graph.db. When I tried to run the code from that corrected DB_PATH I got an error saying that a lock file was in place because the Neo4j server was running. After shutting it down it worked perfectly.
So, if you happen to see the following error, just stop the server and run the code again:
Caused by: org.neo4j.kernel.StoreLockException: Could not create lock file
at org.neo4j.kernel.StoreLocker.checkLock(StoreLocker.java:74)
at org.neo4j.kernel.StoreLockerLifecycleAdapter.start(StoreLockerLifecycleAdapter.java:40)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:491)
I found on several forums that you cannot run the Neo4j server and use the Java API to query it at the same time.

Tailing a binary file in Erlang adds mysterious bit-string

I want to run tail on a named pipe to facilitate some binary logfile processing. The problem is that mysterious data is being added to the beginning of the stream. I run my tests by starting the erlang process with the opened port (open_port) and then I use another shell to cat the bin into the named pipe.
Here is a simple function for getting data from the port:
bin_from_tail() ->
open_port({spawn,"/usr/bin/tail -F named_pipe"},
[binary,in,eof]),
receive
{_,{data,<<Data/binary>>}} -> Data
end.
So here are two ways for me to grab the same data...
Create the named pipe
mkfifo named_pipe
This command blocks until you run "cat log.bin > named_pipe" from another shell
{ok,TailBin} = file:read_file(log.bin).
Read the entire file into memory using the erlang file library
FileBin = file:read_file(log.in).
But TailBin and FileBin are not the same! TailBin has a mysterious 120-byte string at the beginning:
<<40,6,161,69,172,216,56,14,100,0,80,6,0,0,0>>
Thanks for the idea about the endlessly looping cat/restarting a dead port. It appears that named pipes buffer just a little bit, so if the port opens up fast enough the writer process (another program) won't crash! Definitely risky stuff, but as far as hacks go... it works.
Because all the mailing list posts just said do this, do that without examples, I'm going to post how mine works! If anyone wants to offer up improvements, please feel free to do so. My solution:
read() ->
Port = open_port({spawn,"/bin/cat /path/to/pipe"},
[binary,in,eof]),
do_read(Port).
do_read(Port) ->
receive
{Port,{data,<<Data/binary>>}} ->
case do_something:with(Data) of
ok ->
io:format("G") % Good
Any ->
io:format("B") % Bad
end;
{Port,eof} ->
read();
Any ->
io:format("No match fifo_client:do_read/1, ~p~n",[Any])
end,
do_read(Port).
I found the same thing happened outside erlang. The problem is that tail is trying to show you the end of the file, not the whole file. If you use it on a normal file, anything written would be new, and picked up by -f, but in this case it looks like tail is waiting until the end of the file (the eof that comes through the pipe) and then showing the last 10 lines (treating the binary as text).
tail -F -c 9999999
(assuming your log is 9999999 bytes or less) would probably work.
Maybe try using cat instead of tail -F, that seemed to work for me. Then you just need to avoid the fact that cat exits upon eof, which I assume you were trying to avoid by using tail.
So a shell script which loops cat endlessly, maybe?
Or get erlang to restart close and recreate the port when it dies, since you're getting the eof signal anyway. Or use the exit_status flag to open_port to be signalled when the process exits, incase you need to distinguish eof and process exit. (If you use both exit_status and eof, the eof never comes, a brief test with cat < /dev/null indicates)

Resources