How to manage SCDF stream definition and properties? - spring-cloud-dataflow

I wonder if there is a best practice to store a stream DSL and the apps and deployer properties file associated in a SCM like Git ?
What should a Git project for a SCDF Stream look like ?
And how can we manage versions of a stream ?

In my project, which deploys streams and applications to GKE, we maintain 3 separate kinds of data, all stored in native format.
Application Definitions - These are stored in a properties file as you would load into the dashboard. One file has all definitions. Sample line:
source.http=docker:springcloudstream/http-source-kafka:3.1.0
Stream Definitions - These are stored in the native JSON format as you would import/export from the stream dashboard. One file has all streams, with DSLs and descriptions. Sample:
{
"streams": [
{
"name": "http-router",
"dslText": "http | inbound-router: router",
"originalDslText": "http | inbound-router: router",
"description": "Accept incoming requests and forwards for routing"
}
]
}
Stream Properties - Stored as a single file per stream. The contents format matches what you may cut and paste into the dashboard when deploying the stream. Sample:
deployer.*.kubernetes.imagePullPolicy=Always
deployer.*.kubernetes.configMapRefs=ssil
app.http.path-pattern=/api
app.http.server.port=20000
app.inbound-router.router.expression=headers['destination'].toLowerCase()?:'unroutable'
The layout that works best for us is:
spring-cloud-datalow
|- kafka-apps-docker.properties
|- spring-cloud-streams.json
|- stream-properties
|- stream1.properties
|- stream2.properties
Our project has all applications in one repository, so naturally the spring-cloud-dataflow pieces belong there, in their own folder.
Creation and deployment of these applications/streams are delegated to a script that accesses the spring-cloud-dataflow REST api, and minimizes usage of the dashboard.

Related

How can I pass a pointer to a file in helm upgrade command?

I have a truststore file(a binary file) that I need to provide during helm upgrade. This file is different for each target env(dev,qa,staging or prod). So I can only provide this file at time of deployment. helm upgrade --set-file does not take a binary file. This seem to be the issue I found here: https://github.com/helm/helm/issues/3276. This truststore files are stored in Jenkins Credential store.
As the command itself is described below:
--set-file stringArray set values from respective files specified via the command line (can specify multiple or separate values with commas: key1=path1,key2=path2)
it is also important to know The Format and Limitations of
--set.
The error you see: Error: failed parsing --set-file data... means that the file you are trying to use does not meet the requirements. See the example below:
--set-file key=filepath is another variant of --set. It reads the
file and use its content as a value. An example use case of it is to
inject a multi-line text into values without dealing with indentation
in YAML. Say you want to create a brigade project with certain value
containing 5 lines JavaScript code, you might write a values.yaml
like:
defaultScript: |
const { events, Job } = require("brigadier")
function run(e, project) {
console.log("hello default script")
}
events.on("run", run)
Being embedded in a YAML, this makes it harder for you to use IDE
features and testing framework and so on that supports writing code.
Instead, you can use --set-file defaultScript=brigade.js with
brigade.js containing:
const { events, Job } = require("brigadier")
function run(e, project) {
console.log("hello default script")
}
events.on("run", run)
I hope it helps.

Resumable upload with new file with special characters in name

I'm following the documentation to create a new upload session for a resumable file upload.
My request looks like:
/v1.0/me/drive/items/:folderId/children/:fileName/createUploadSession
This works when :fileName is something like test.txt or even test 2.txt. But throwing special characters in there like test".txt or test%22.txt cause the request to fail.
There no examples in the documentation on how to deal with special characters in this case, so is this supported?
File stored in OneDrive have similar naming conventions/restrictions to files stored locally. If you consider that OneDrive can sync to your local file system, it makes sense why this is the case.
In general, you should assume you cannot use any of these characters in your file names:
~ " # % & * : < > ? / \ { | }.
You can find the complete list at Invalid file names and file types in OneDrive, OneDrive for Business, and SharePoint.

Parsing NYC Transit/MTA historical GTFS data (not realtime)

I've been puzzling on this on and off for months and can't find a solution.
The MTA claims to provide historical data in form of daily dumps in GTFS format here:
[http://web.mta.info/developers/MTA-Subway-Time-historical-data.html][1]
See for yourself by downloading the example they provide, in this case Sep, 17th , 2014:
[https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31][1]
My problem? The file is gobbledygook. It does not follow GTFS specifications, has no extension, and when I open it using a text editor it looks like 7800 lines of this:
n
^C1.0^X �枪�^Eʞ>`
^C1.0^R^K
^A1^R^F^P����^E^R^K
^A2^R^F^P����^E^R^K
^A3^R^F^P����^E^R^K
^A4^R^F^P����^E^R^K
^A5^R^F^P����^E^R^K
^A6^R^F^P����^E^R^K
^AS^R^F^P����^E^R[
^F000001^ZQ
6
^N050400_1..S02R^Z^H20140917*^A1�>^V
^P01 0824 242/SFY^P^A^X^C^R^W^R^F^Pɚ��^E"^D140Sʚ>^F
^AA^R^AA^RR
^F000002"H
6
Per the MTA site (appears untrue)
All data is formatted in GTFS-realtime
Any idea on the steps necessary to transform this mystery file into usable GTFS data? Is there some encoding I am missing? I have looked for 10+ and been unable to come up with a solution.
Also, not to be a stickler but I am NOT referring to the MTA's realtime data feed, which is correctly formatted and usable. I am specifically referring to the historical data dumps I reference above (have received many "solutions" referring only to realtime data feed)
The file you link to is in GTFS-realtime format, not GTFS, and the page you linked to does a very bad job of explaining which format their data is actually in (though it is mentioned in your quote).
GTFS is used to store schedule data, like routes and scheduled arrival times.
GTFS-realtime is generally used to transfer actual transit performance data in real-time, like vehicle locations and expected or actual arrival times. It is a protobuf, a specification for compiled binary data publicized by Google, which means you can't usefully read it in a text editor, but you instead have to load it programmatically using the Google protobuf tools. It can be used as a historical data format in the way MTA is here, by making daily dumps of the GTFS-rt feed publicly available. It's called GTFS-realtime because various data fields in the realtime like route_id, trip_id, and stop_id are designed to link to the published GTFS schedules.
I confirmed the validity of the data you linked to by decompiling it using the gtfs-realtime.proto specification and the Google protobuf tools for Python. It begins:
header {
gtfs_realtime_version: "1.0"
timestamp: 1410960621
}
entity {
id: "000001"
trip_update {
trip {
trip_id: "050400_1..S02R"
start_date: "20140917"
route_id: "1"
}
stop_time_update {
arrival {
time: 1410960713
}
stop_id: "140S"
}
}
}
...
and continues in that vein for a total of 55833 lines (in the default string output format).
EDIT: the Python script used to convert the protobuf into string representation is very simple:
import gtfs_realtime_pb2 as gtfs_rt
f = open('gtfs-rt.pb', 'rb')
raw_str = f.read()
msg = gtfs_rt.FeedMessage()
msg.ParseFromString(raw_str)
print msg
This requires gtfs-realtime.proto to have been compiled into gtfs_realtime_pb2.py using protoc (following the instructions in the Python protobuf documentation under "Compiling Your Protocol Buffers") and placed in the same directory as the Python script. Furthermore, the binary protobuf downloaded from the MTA needs to be named gtfs-rt.pb and located in the same directory as the Python script.

How to recursively download FTP folder in parallel in Ruby?

I need to cache an ftp folder locally in ruby. Right now I'm using ftp_sync to download the ftp folder but it's painfully slow, do you guys know any library that can download the folder files in parallel?
Thanks!
The syncftp gem may help you:
http://rubydoc.info/gems/syncftp/0.0.3/frames
Ruby has a decent built-in FTP library in case you want to roll your own:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/net/ftp/rdoc/Net/FTP.html
To download files in parallel, you can use multiple threads with timeouts:
Ruby Net::FTP Timeout Threads
A great way to get parallel work done is Celluloid, the concurrent framework:
https://github.com/celluloid/celluloid
All that said, if the download speed is limited to your overall network bandwidth, then none of these approaches will help much.
To speed up the transfers in this case, be sure you're only downloading the information that's changed: new files and changed sections of existing files.
Segmented downloading can give massive speedups in some cases, such as downloaded big log files where only a small percentage of the file has changed, and the changes are all at the end of the file, and are all appends.
You can also consider shelling out to the command line. There are many tools that can help you with this. A good general-purpose one is "curl", which supports simple ranges for FTP files as well, for example you can get the first 100 bytes of a document using FTP like this:
curl -r 0-99 ftp://www.get.this/README
Are you open to other protocols besides FTP? Take a look at the "rsync" command, which is excellent for download synchronization. The rsync command has many optimizations to transfer just the changed data. For example rsync can sync a remote directory to a local directory like this:
rsync -auvC me#my.com:/remote/foo/ /local/foo/
Take a look at Curb. It's a wrapper around Curl, and can do multiple connections in parallel.
This is a modified version of one of their examples:
require 'curb'
urls = %w[
http://ftp.ruby-lang.org/pub/ruby/1.9/ruby-1.9.3-p286.tar.bz2
http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2
]
responses = {}
m = Curl::Multi.new
# add a few easy handles
urls.each do |url|
responses[url] = Curl::Easy.new(url)
puts "Queuing #{ url }..."
m.add(responses[url])
end
spinner_counter = 0
spinner = %w[ | / - \ ]
m.perform do
print 'Performing downloads ', spinner[spinner_counter], "\r"
spinner_counter = (spinner_counter + 1) % spinner.size
end
puts
urls.each do |url|
print "[#{ url } #{ responses[url].total_time } seconds] Saving #{ responses[url].body_str.size } bytes..."
File.open(File.basename(url), 'wb') { |fo| fo.write(responses[url].body_str) }
puts 'done.'
end
That'll pull in both the Ruby and Python source (which are pretty big so they'll take about a minute, depending on your internet connection and host). You won't see any files appear until the last block, where they get written out.

Jenkins Continuous Integration with Amazon S3 - Everything is uploading to the root?

I'm running Jenkins and I have it successfully working with my GitHub account, but I can't get it working correctly with Amazon S3.
I installed the S3 plugin and when I run a build it successfully uploads to the S3 bucket I specify, but all of the files uploaded end up in the root of the bucket. I have a bunch of folders (such as /css /js and so on), but all of the files in those folders from hithub end up in the root of my S3 account.
Is it possible to get the S3 plugin to upload and retain the folder structure?
It doesn't look like this is possible. Instead, I'm using s3cmd to do this. You must first install it on your server, and then in one of the bash scripts within a Jenkins job you can use:
s3cmd sync -r -P $WORKSPACE/ s3://YOUR_BUCKET_NAME
That will copy all of the files to your S3 account maintaining the folder structure. The -P keeps read permissions for everyone (needed if you're using your bucket as a web server). This is a great solution using the sync feature, because it compares all your local files against the S3 bucket and only copies files that have changed (by comparing file sizes and checksums).
I have never worked with the S3 plugin for Jenkins (but now that I know it exists, I might give it a try), though, looking at the code, it seems you can only do what you want using a workaround.
Here's what the actual plugin code does (taken from github) --I removed the parts of the code that are not relevant for the sake of readability:
class hudson.plugins.s3.S3Profile, method upload:
final Destination dest = new Destination(bucketName,filePath.getName());
getClient().putObject(dest.bucketName, dest.objectName, filePath.read(), metadata);
Now if you take a look into hudson.FilePath.getName()'s JavaDoc:
Gets just the file name portion without directories.
Now, take a look into the hudson.plugins.s3.Destination's constructor:
public Destination(final String userBucketName, final String fileName) {
if (userBucketName == null || fileName == null)
throw new IllegalArgumentException("Not defined for null parameters: "+userBucketName+","+fileName);
final String[] bucketNameArray = userBucketName.split("/", 2);
bucketName = bucketNameArray[0];
if (bucketNameArray.length > 1) {
objectName = bucketNameArray[1] + "/" + fileName;
} else {
objectName = fileName;
}
}
The Destination class JavaDoc says:
The convention implemented here is that a / in a bucket name is used to construct a structure in the object name. That is, a put of file.txt to bucket name of "mybucket/v1" will cause the object "v1/file.txt" to be created in the mybucket.
Conclusion: the filePath.getName() call strips off any prefix (S3 does not have any directory, but rather prefixes, see this and this threads for more info) you add to the file. If you really need to put your files into a "folder" (i.e. having a specific prefix that contains a slash (/)), I suggest you to add this prefix to the end of your bucket name, as explicited in the Destination class JavaDoc.
Yes this is possible.
It looks like for each folder destination, you'll need a separate instance of the S3 plugin however.
"Source" is the file you're uploading.
"Destination bucket" is where you place your path.
Using Jenkins 1.532.2 and S3 Publisher Plug-In 0.5, the UI configure Job screen rejects additional S3 publish entries. There would also be a significant maintenance benefit to us if the plugin recreated the workspace directory structure as we'll have many directories to create.
Set up your git plugin.
Set up your Bash script
All in your folder marked as "*" will go to bucket

Resources