Google cloud dataflow reading from compressed files - google-cloud-dataflow

I need to parse json data from files in GCS which is compressed, since the files extension is .gz so it should be reorganized and handled properly by dataflow, however the job log printed out unreadable characters and data not processed. when I process uncompressed data it worked fine. I used the following method to map/parse json:
ObjectMapper mapper = new ObjectMapper();
Map<String, String> eventDetails = mapper.readValue(c.element(),
new TypeReference<Map<String, String>>() {
});
any idea what could be the cause?
===================================
To add more details about how to read from input files:
to create a pipeline:
Poptions pOptions = PipelineOptionsFactory.fromArgs(args).withValidation().as(Poptions.class);
Pipeline p = Pipeline.create(pOptions);
p.apply(TextIO.Read.named("ReadLines").from(pOptions.getInput()))
.apply(new Pimpression())
.apply(BigQueryIO.Write
.to(pOptions.getOutput())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run();
configuration at run time:
PROJECT="myProjectId"
DATASET="myDataSetId"
INPUT="gs://foldername/input/*"
STAGING1="gs://foldername/staging"
TABLE1="myTableName"
mvn exec:java -pl example \
-Dexec.mainClass=com.google.cloud.dataflow.examples.Example1 \
-Dexec.args="--project=${PROJECT} --output=${PROJECT}:${DATASET}.${TABLE1} --input=${INPUT} --stagingLocation=${STAGING1} --runner=BlockingDataflowPipelineRunner"
input file name example: file.gz, and the output of command gsutil ls -L gs://bucket/input/file.gz | grep Content- is:
Content-Length: 483100
Content-Type: application/octet-stream

After following up privately, we determined that this issue was due to using an older version of the Dataflow SDK (pre-gzip support). Since Dataflow is in alpha and the SDK is being continually updated, ensure that the SDK version you are using is up to date (either from Maven central or GitHub).

Related

Install Custom Dependency for KFP Op

I'm trying to setup a simple KubeFlow pipeline, and I'm having trouble packaging up dependencies in a way that works for KubeFlow.
The code simply downloads a config file and parses it, then passes back the parsed configuration.
However, in order to parse the config file, it needs to have access to another internal python package.
I have a .tar.gz archive of the package hosted on a bucket in the same project, and added the URL of the package as a dependency, but I get an error message saying tarfile.ReadError: not a gzip file.
I know the file is good, so it's some intermediate issue with hosting on a bucket or the way kubeflow installs dependencies.
Here is a minimal example:
from kfp import compiler
from kfp import dsl
from kfp.components import func_to_container_op
from google.protobuf import text_format
from google.cloud import storage
import training_reader
def get_training_config(working_bucket: str,
working_directoy: str,
config_file: str) -> training_reader.TrainEvalPipelineConfig:
download_file(working_bucket, os.path.join(working_directoy, config_file), "ssd.config")
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
config_op_packages = ["https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz",
"google-cloud-storage",
"protobuf"
]
training_config_op = func_to_container_op(get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages)
def output_config(config: training_reader.TrainEvalPipelineConfig) -> None:
print(config)
output_config_op = func_to_container_op(output_config)
#dsl.pipeline(
name='Post Training Processing',
description='Building the post-processing pipeline'
)
def ssd_postprocessing_pipeline(
working_bucket: str,
working_directory: str,
config_file:str):
config = training_config_op(working_bucket, working_directory, config_file)
output_config_op(config.output)
pipeline_name = ssd_postprocessing_pipeline.__name__ + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
The https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz IRL requires authentication. Try to download it in Incognito mode and you'll see the login page instead of file.
Changing the URL to https://storage.googleapis.com/my_bucket/packages/training-reader-0.1.tar.gz works for public objects, but your object is not public.
The only thing you can do (if you cannot make the package public) is to use google.cloud.storage library or gsutil program to download the file from the bucket and then manually install it suing subprocess.run([sys.executable, '-m', 'pip', 'install', ...])
Where are you downloading the data from?
What's the purpose of
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
Why not just do the following:
def get_training_config(
working_bucket: str,
working_directory: str,
config_file: str,
output_config_path: OutputFile('TrainEvalPipelineConfig'),
):
download_file(working_bucket, os.path.join(working_directoy, config_file), output_config_path)
the way kubeflow installs dependencies.
Export your component to loadable component.yaml and you'll see how KFP Lighweight components install dependencies:
training_config_op = func_to_container_op(
get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages,
output_component_file='component.yaml',
)
P.S. Some small pieces of info:
#dsl.pipeline(
Not required unless you want to use the dsl-compile command-line program
pipeline_name = ssd_postprocessing_pipeline.name + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
Did you know that you can just kfp.Client(host=...).create_run_from_pipeline_func(ssd_postprocessing_pipeline, arguments={}) to run the pipeline right away?

How to return the exact names of the new artifacts added to Jfrog artifactory using URL Trigger plugin in Jenkins

I need to poll the artifactory URL every night and find out which file got added, and use that name of the new artifact as a parameter to trigger another job in Jenkins. But the URLTrigger plugin doesn't return the name of the new artifacts? Is there any way to derive that?
I use groovy to run a curl command to extract and parse the metadata.xml to work out the jar name.
Assuming Artifactory has metadata content that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<groupId>path.to.application</groupId>
<artifactId>jarName</artifactId>
<versioning>
<latest>6.1.12-SNAPSHOT</latest>
<release>6.1.11</release>
<versions>
<version>6.1.11</version>
<version>6.1.12-SNAPSHOT</version>
</versions>
<lastUpdated>20181122121509</lastUpdated>
</versioning>
</metadata>
Thus the build information I want want is 'jarName-6.1.12-SNAPSHOT.jar'
import org.xml.sax.SAXParseException;
//Assumed artifactory path to application.jar
def metaDataPath = 'https://your.artifactory.server/artifactory/path/to/application/jarName/maven-metadata.xml'
//Get the file using curl (you might need to use a proxy), with an api token for authentication
def metadataContent = 'curl -x<your-proxy:80> -H "X-JFrog-Art-Api:<your token>" -XGET ' + metaDataPath
metadataContent = metadataContent.execute().text
//Parse it to get the 'latest' element
def parsedXml = (new XmlParser()).parseText(metadataContent)
println parsedXml.versioning.latest.text() //6.1.12-SNAPSHOT
If your snapshot builds include a timestamp in their name, then you would need to use the returned 6.1.12-SNAPSHOT to build a new metadata path:
https://your.artifactory.server/artifactory/path/to/application/jarName/6.1.12-SNAPSHOT/maven-metadata.xml
To then repeat the extract and parse process to get the timestamped name from the child metadata.xml

How to use the "recovery tool" to repair a perfino h2 database?

Our Perfino server crashed recently, logging since then the ERROR shown below. (There are some clues hinting to an OutOfMemory resulting in a corrupt db.)
It is suggested: 'Possible solution: use the recovery tool'. But neither the official perfino documentation nor the logs offer more instructions on how to proceed.
So here the question: how to use the recovery tool?
Stacktrace:
ERROR [collector] server: could not load transaction data
org.h2.jdbc.JdbcSQLException: File corrupted while reading record: "[495834] stream data key:64898 pos:11 remaining:0". Possible solution: use the recovery tool; SQL statement:
SELECT value FROM transaction_names WHERE id=? [90030-176]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:344)
at org.h2.message.DbException.get(DbException.java:178)
at org.h2.message.DbException.get(DbException.java:154)
at org.h2.index.PageDataIndex.getPage(PageDataIndex.java:242)
at org.h2.index.PageDataNode.getNextPage(PageDataNode.java:233)
at org.h2.index.PageDataLeaf.getNextPage(PageDataLeaf.java:400)
at org.h2.index.PageDataCursor.nextRow(PageDataCursor.java:95)
at org.h2.index.PageDataCursor.next(PageDataCursor.java:53)
at org.h2.index.IndexCursor.next(IndexCursor.java:278)
at org.h2.table.TableFilter.next(TableFilter.java:361)
at org.h2.command.dml.Select.queryFlat(Select.java:533)
at org.h2.command.dml.Select.queryWithoutCache(Select.java:646)
at org.h2.command.dml.Query.query(Query.java:323)
at org.h2.command.dml.Query.query(Query.java:291)
at org.h2.command.dml.Query.query(Query.java:37)
at org.h2.command.CommandContainer.query(CommandContainer.java:91)
at org.h2.command.Command.executeQuery(Command.java:197)
at org.h2.jdbc.JdbcPreparedStatement.executeQuery(JdbcPreparedStatement.java:109)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:353)
at com.perfino.a.f.b.a.a(ejt:70)
at com.perfino.a.f.o.a(ejt:880)
at com.perfino.a.f.o.a(ejt:928)
at com.perfino.a.f.o.a(ejt:60)
at com.perfino.a.f.aa.a(ejt:783)
at com.perfino.a.f.o.a(ejt:847)
at com.perfino.a.f.o.a(ejt:792)
at com.perfino.a.f.o.a(ejt:787)
at com.perfino.a.f.o.a(ejt:60)
at com.perfino.a.f.ac.a(ejt:1011)
at com.perfino.b.a.b(ejt:68)
at com.perfino.b.a.c(ejt:82)
at com.perfino.a.f.o.a(ejt:1006)
at com.perfino.a.i.b.d.a(ejt:168)
at com.perfino.a.i.b.d.b(ejt:155)
at com.perfino.a.i.b.d.b(ejt:52)
at com.perfino.a.i.b.d.a(ejt:45)
at com.perfino.a.i.a.b.a(ejt:94)
at com.perfino.a.c.a.b(ejt:105)
at com.perfino.a.c.a.a(ejt:37)
at com.perfino.a.c.c.run(ejt:57)
at java.lang.Thread.run(Thread.java:745)
Notice: I couldn't recover my database with the procedure described below. I'm still keeping this post as reference, as the probability of a successful recovery will depend on how broken the database is, and there is no evidence that this procedure is invalid.
Perfino uses by default the H2 Database Engine as its persistence storage. H2 has a recovery tool and a run script tool to import sql statements:
# 1. Create a dump of the current database using the tool [1]
# This tool creates a 'config.h2.sql' and a 'perfino.h2.sql' db dump
cd ${PERFINO_DATA_DIR}
java -cp ${PATH_TO_H2_LIB}/h2*.jar org.h2.tools.Recover
# 2. Rename the corrupt database file to e.g. *bkp
mv perfino.h2.db perfino.h2.db.bkp
# 3. Import the dump from step 1, ignoring errors
java -cp ${PATH_TO_H2_LIB}/h2*.jar \
org.h2.tools.RunScript \
-url jdbc:h2:${PERFINO_DATA_DIR}/db/perfino \
-script perfino.h2.sql -checkResults
[1]: Perfino includes a version of the h2.jar under ${PERFINO_INSTALL_DIR}/lib/common/h2.jar. You could of course download the official jar and try with it, but in my case, I could only restore the database with the jar supplied with perfino.
This failed for me with a
Exception in thread "main" org.h2.jdbc.JdbcSQLException: Feature not supported: "Restore page store recovery SQL script can only be restored to a PageStore file".
If this happens to you, try:
# 1. Delete database and mv files
cd ${PERFINO_DATA_DIR}
rm perfino.h2.db perfino.mv.db
# 2. Create a PageStore database manually
touch perfino.h2.db
# 3. try with MV_STORE=FALSE on the url [2]
java -cp ${PATH_TO_H2_LIB}/h2*.jar \
org.h2.tools.RunScript \
-url jdbc:h2:${PERFINO_DATA_DIR}/db/perfino;MV_STORE=FALSE \
-checkResults \
-continueOnError
[2]: force h2 to recreate a pagestore db instead of the new storage engine (See this thread in metabase)
I found this article trying to repair a Confluence internal h2 database and this worked for me. Here's a shell script as a gist on my GitHub with what I did - you'll have to adjust for your environment.

What is the easiest way to get an HTTP response from command-line Dart?

I am writing a command-line script in Dart. What's the easiest way to access (and GET) an HTTP resource?
Use the http package for easy command-line access to HTTP resources. While the core dart:io library has the primitives for HTTP clients (see HttpClient), the http package makes it much easier to GET, POST, etc.
First, add http to your pubspec's dependencies:
name: sample_app
description: My sample app.
dependencies:
http: any
Install the package. Run this on the command line or via Dart Editor:
pub install
Import the package:
// inside your app
import 'package:http/http.dart' as http;
Make a GET request. The get() function returns a Future.
http.get('http://example.com/hugs').then((response) => print(response.body));
It's best practice to return the Future from the function that uses get():
Future getAndParse(String uri) {
return http.get('http://example.com/hugs')
.then((response) => JSON.parse(response.body));
}
Unfortunately, I couldn't find any formal docs. So I had to look through the code (which does have good comments): https://code.google.com/p/dart/source/browse/trunk/dart/pkg/http/lib/http.dart
this is the shortest code i could find
curl -sL -w "%{http_code} %{url_effective}\\n" "URL" -o /dev/null
Here, -s silences curl's progress output, -L follows all redirects as before, -w prints the report using a custom format, and -o redirects curl's HTML output to /dev/null.
Here are the other special variables available in case you want to customize the output some more:
url_effective
http_code
http_connect
time_total
time_namelookup
time_connect
time_pretransfer
time_redirect
time_starttransfer
size_download
size_upload
size_header
size_request
speed_download
speed_upload
content_type
num_connects
num_redirects
ftp_entry_path

Process only changed files

What:
With jenkins I want to process periodically only changed files from SVN and commit the output of the processing back to SVN.
Why:
We are committing binary files into SVN (we are working with Oracle Forms and are committing fmb-Files). I created a script which exports the fmb's to xml (with the original Fmb2XML tool from Oracle) and than I convert the XML to plain source which we also want to commit. This allows us greping, viewing the changes, ....
Problem:
At the moment I am only able to checkout everything, convert the whole directory and committing the whole directory back to SVN. But since all plain text files are newly generated they appear changed in SVN. I want to commit only the changed ones.
Can anyone help me with this?
I installed the Groovy plugin, configured the Groovy language and created a script which I execute as "system Groovy script". The scripts looks like:
import java.lang.ProcessBuilder.Redirect
import hudson.model.*
import hudson.util.*
import hudson.scm.*
import hudson.scm.SubversionChangeLogSet.LogEntry
// uncomment one of the following def build = ... lines
// work with current build
def build = Thread.currentThread()?.executable
// for testing, use last build or specific build number
//def item = hudson.model.Hudson.instance.getItem("Update_SRC_Branch")
//def build = item.getLastBuild()
//def build = item.getBuildByNumber(35)
// get ChangesSets with all changed items
def changeSet= build.getChangeSet()
List<LogEntry> items = changeSet.getItems()
def affectedFiles = items.collect { it.paths }
// get filtered file names (only fmb) without path
def fileNames = affectedFiles.flatten().findResults {
if (it.path.substring(it.path.lastIndexOf(".") + 1) != "fmb") return null
it.path.substring(it.path.lastIndexOf("/") + 1)
}.sort().unique()
// setup log files
def stdOutFile = "${build.rootDir}\\stdout.txt"
def stdErrFile = "${build.rootDir}\\stderr.txt"
// now execute the external transforming
fileNames.each {
def params = [...]
def processBuilder = new ProcessBuilder(params)
// redirect stdout and stderr to log files
processBuilder.redirectOutput(new File(stdOutFile))
processBuilder.redirectError(new File(stdErrFile))
def process = processBuilder.start()
process.waitFor()
// print log files
println new File(stdOutFile).readLines()
System.err.println new File(stdErrFile).readLines()
}
Afterwards I use command line with "svn commit" to commit the updated files.
Preliminary note: getting files from repo in SVN-jargon is "checkout", saving to repo - "commit". Don't mix CVS and SVN terms, it can lead to misinterpretation
In order to get list of changed files in revision (or revset) you can use
Easy way - svn log with options -q -v. For single revision you also add -c REVNO, for revision range: -r REVSTART:REVEND. Probably additional --xml will produce more suitable output, than plain-text
You must to post-process output of log in order to get pure list, because: log contain some useless for you data, in case of log for range you can have the same file included in more than one revision
z:\>svn log -q -v -r 1190 https://subversion.assembla.com/svn/customlocations-greylink/
------------------------------------------------------------------------
r1190 | lazybadger | 2012-09-20 13:19:45 +0600 (Чт, 20 сен 2012)
Changed paths:
M /trunk/Abrikos.ini
M /trunk/ER-Telecom.ini
M /trunk/GorNet.ini
M /trunk/KrosLine.ini
M /trunk/Rostelecom.ini
M /trunk/Vladlink.ini
------------------------------------------------------------------------
example of single revision: you have to log | grep trunk | sort -u, add repo-base to filenames
Harder way: with additional SCM (namely - Mercurial) and hgsubversion you'll get slightly more (maybe) log with hg log --template "{files}\n" - only slightly because you'll get only filelist, but filesets in different revisions are newline-separated, filenames inside revision are space-separated

Resources