Run cvb in mahout 0.8 - mahout

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.
Thus, I want to:
preprocess some texts correctly
run the cvb0_local version of cvb
inspect the results by looking at the top n words in each of the generated topics

So here are the subsequent Mahout commands I had to call in a linux shell to do it.
$MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out

Thanks to JoKnopp for the detail commands.
If you get:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations":
--maxIterations (-m) maxIterations
I use -m 20 and it works
refer to:
https://issues.apache.org/jira/browse/MAHOUT-1141

Related

Exporting encrypted SNMPv3 traps to JSON with TShark

I have a pcap file with recordings of encrypted SNMPv3 traps from Wireshark (Version 3.2.2). For analyzing the traps, I want to export the protocol data to json using tshark.
tshark.exe -T ek -Y "snmp" -P -V -x -r input.pcap > output.json
Currently, I supply the infos to decrypt the packages via the "snmp_users" file in C:\Users\developer\AppData\Roaming\Wireshark.
# This file is automatically generated, DO NOT MODIFY.
,"snmp_user","SHA1","xxxxxx","AES","yyyyyyy"
Is it possible to supply the options via commandline?
I have tried:
tshark.exe -T ek -Y "snmp" -P -V -x -o "snmp.users_table.username:snmp_user" ...
But that causes an error:
tshark: -o flag "snmp.users_table.username:snmp_user" specifies unknown preference
Update 16.09.2020:
Option -Y used instead of -J:
-Y|--display-filter
Cause the specified filter (which uses the syntax of read/display
filters, rather than that of capture filters) to be applied before
printing a decoded form of packets or writing packets to a file.
You need to specify the option as a User Access Table or uat, with the specific table being the name of the file, namely snmp_users. So, for example:
On Windows:
tshark.exe -o "uat:snmp_users:\"\",\"snmp_user\",\"SHA1\",\"xxxxxx\",\"AES\",\"yyyyyyy\"" -T ek -J "snmp" -P -V -x -r input.pcap > output.json
And on *nix:
tshark -o 'uat:snmp_users:"","snmp_user","SHA1","xxxxxx","AES","yyyyyyy"' -T ek -J "snmp" -P -V -x -r input.pcap > output.json
Unfortunately, the Wireshark documentation is currently lacking in describing the uat option. There is a Google Summer of Code project underway though, of which Wireshark is participating, so perhaps documentation will be improved here.

Use xmlstarlet to convert XML containing repeated and missing fields to tab delimited

I have a large complex XML file containing a pattern like the following
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<field1>field1</field1>
<field2>field2</field2>
<field2>field2</field2>
<field3>field3</field3>
<field4>field4</field4>
<field4>field4</field4>
</record>
<record>
<field1>field1</field1>
<field1>field1</field1>
<field3>field3</field3>
<field4>field4</field4>
<field4>field4</field4>
</record>
</records>
I would like to use xmlstarlet to convert it to tab delimited with repeated fields subdelimited with a semicolon, e.g.
field1\tfield2;field2\tfield3\tfield4;field4
field1;field1\t\tfield3\t\field4;field4
I can do what I need by collapsing repeated fields with a string processing routine before feeding the file to xmlstarlet but that feels hacky. Is there an elegant way to do it all in xmlstarlet?
It's been a while since you asked. Nevertheless...
Using xmlstarlet version 1.6.1 to extract and sort field names
to determine field order, and then produce tab-separated values:
xmlstarlet sel \
-N set="http://exslt.org/sets" -N str="http://exslt.org/strings" \
-T -t \
--var fldsep="'$(printf "\t")'" \
--var subsep="';'" \
--var allfld \
-m '*/record/*' \
-v 'name()' --nl \
-b \
-b \
--var uniqfld='set:distinct(str:split(normalize-space($allfld)))' \
-m '*/record' \
--var rec='.' \
-m '$uniqfld' \
--sort 'A:T:-' '.' \
-v 'substring($fldsep,1,position()!=1)' \
-m '$rec/*[local-name() = current()]' \
-v 'substring($subsep,1,position()!=1)' \
-v '.' \
-b \
-b \
--nl < file.xml
EDIT: --sort moved from $allfld to -m $uniqfld.
where:
all field names in input are collected in the $allfld variable
the exslt functions set:distinct
and str:split
are used to create a nodeset of unique field names from $allfld
the $uniqfld nodeset determines field output order
repeated fields are output in document order here but -m '$rec/*[…]'
accepts a --sort clause
the substring expression emits a separator when position() != 1
Given the following input, which is different from OP's,
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<field2>r1-f2A</field2>
<field2>r1-f2B</field2>
<field3>r1-f3</field3>
<field4>r1-f4A</field4>
<field4>r1-f4B</field4>
<field6>r1-f6</field6>
</record>
<record>
<field1>r2-f1A</field1>
<field1>r2-f1B</field1>
<field3/>
<field4>r2-f4A</field4>
<field4>r2-f4B</field4>
<field5>r2-f5</field5>
<field5>r2-f5</field5>
</record>
<record>
<field6>r3-f6</field6>
<field4>r3-f4</field4>
<field2>r3-f2B</field2>
<field2>r3-f2A</field2>
</record>
</records>
output becomes:
r1-f2A;r1-f2B r1-f3 r1-f4A;r1-f4B r1-f6
r2-f1A;r2-f1B r2-f4A;r2-f4B r2-f5;r2-f5
r3-f2B;r3-f2A r3-f4 r3-f6
or, piped through sed -n l to show non-printables,
\tr1-f2A;r1-f2B\tr1-f3\tr1-f4A;r1-f4B\t\tr1-f6$
r2-f1A;r2-f1B\t\t\tr2-f4A;r2-f4B\tr2-f5;r2-f5\t$
\tr3-f2B;r3-f2A\t\tr3-f4\t\tr3-f6$
Using a custom field output order things can be reduced to:
xmlstarlet sel -T -t \
--var fldsep="'$(printf "\t")'" \
--var subsep="';'" \
-m '*/record' \
--var rec='.' \
-m 'str:split("field6 field4 field2 field1 field3 field5")' \
-v 'substring($fldsep,1,position()!=1)' \
-m '$rec/*[local-name() = current()]' \
-v 'substring($subsep,1,position()!=1)' \
-v '.' \
-b \
-b \
--nl < file.xml
again emitting repeated fields in document order in absence of --sort.
Note that using an EXSLT function in a --var clause requires the
corresponding namespace to be declared explicitly with -N to avoid
a runtime error (why?).
To list the generated XSLT 1.0 / EXSLT code add -C before the -t option.
To make one-liners of the formatted xmlstarlet commands above -
stripping line continuation chars and leading whitespace - pipe
them through this sed command:
sed -e ':1' -e 's/^[[:blank:]]*//;/\\$/!b;$b;N;s/\\\n[[:blank:]]*//;b1'
To list elements in the input file xmlstarlet's el command can be
useful, -u for unique:
xmlstarlet el -u file.xml
Output:
records
records/record
records/record/field1
records/record/field2
records/record/field3
records/record/field4
records/record/field5
records/record/field6
You can use the following xmlstarlet command:
xmlstarlet sel -t -m "/records/record" -m "*[starts-with(local-name(),field)]" -i "position()=1" -v "." --else -i "local-name() = local-name(preceding-sibling::*[1])" -v "concat(';',.)" --else -v "concat('\t',.)" -b -b -b -n input.xml
In pseudo-code, it represents something like this
For-each /records/record
For-each element-name that starts with field
If it's the first element, output the item
Else
If check if the current element name equals the preceding one
Then output ;Item
Else output \tItem
b does mean "break" out of loop or if clause
n outputs a newline
Its output is
field1\tfield2;field2\tfield3\tfield4;field4
field1;field1\tfield3\tfield4;field4
The above code differentiates on the base of element names. If you want to differentiate based on element values instead, use the following command:
xmlstarlet sel -t -m "/records/record" -m "*[starts-with(local-name(),field)]" -i "position()=1" -v "." --else -i ". = preceding-sibling::*[1]" -v "concat(';',.)" --else -v "concat('\t',.)" -b -b -b -n a.xml
With your example XML the output is the same (because field names and field values are identical).

Lex compilation error of IOS example of opencascade library in Xcode

I'm trying to build an IOS example of OPENCascade library on MacOs. Xcode version were used: 10.2, 10, 3, 11.1. RIght now I'm getting the following types of errors:
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:62: name defined twice
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:63: bad character: {
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:65: bad character: }
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:66: premature EOF
flex: error deleting output file ../project.build/DerivedSources/BRepFeat_MakeCylindricalHole.yy.cxx
Command ../XcodeDefault.xctoolchain/usr/bin/lex failed with exit code 1
Possible reasons in my opinion:
1) I don't have all of the files in the project (I've checked it, so it shouldn't be a reason)
2) Xcode doesn't treat .lxx files in a proper way.
Within OCCT file name conversions, .lxx is an extension for inline C++ header files, included by co-named .hxx header files. BRepFeat package has no any .yacc/.lex files, thus BRepFeat_MakeCylindricalHole.yy.cxx should not exist at all.
It looks like the issue is somewhere within building routine (CMake or Tcl script) generating XCode project / Makefile. It is unclear from question if an issue happens on building OCCT itself (and which steps have been taken) or while building iOS sample (is it the one coming with OCCT or written from scratch?).
CMake build for OCCT can be configured via the following cross-compilation toolchain and pseudo bash script:
https://github.com/leetal/ios-cmake
export IPHONEOS_DEPLOYMENT_TARGET=8.0
aFreeType=$HOME/freetype-2.7.1-ios
cmake -G "Unix Makefiles" \
-D CMAKE_TOOLCHAIN_FILE:FILEPATH="$HOME/ios-cmake.git/ios.toolchain.cmake" \
-D PLATFORM:STRING="OS64" \
-D ARCHS:STRING="arm64" \
-D IOS_DEPLOYMENT_TARGET:STRING="$IPHONEOS_DEPLOYMENT_TARGET" \
-D ENABLE_VISIBILITY:BOOL="TRUE" \
-D CMAKE_C_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL="OFF" \
-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL="OFF" \
-D CMAKE_BUILD_TYPE:STRING="Release" \
-D BUILD_LIBRARY_TYPE:STRING="Static" \
-D INSTALL_DIR:PATH="work/occt-ios-install" \
-D INSTALL_DIR_INCLUDE:STRING="inc" \
-D INSTALL_DIR_LIB:STRING="lib" \
-D INSTALL_DIR_RESOURCE:STRING="src" \
-D INSTALL_NAME_DIR:STRING="#executable_path/../Frameworks" \
-D 3RDPARTY_FREETYPE_DIR:PATH="$aFreeType" \
-D 3RDPARTY_FREETYPE_INCLUDE_DIR_freetype2:FILEPATH="$aFreeType/include" \
-D 3RDPARTY_FREETYPE_INCLUDE_DIR_ft2build:FILEPATH="$aFreeType/include" \
-D 3RDPARTY_FREETYPE_LIBRARY_DIR:PATH="$aFreeType/lib" \
-D USE_FREEIMAGE:BOOL="OFF" \
-D BUILD_MODULE_Draw:BOOL="OFF" \
"/Path/To/OcctSourceCode"
aNbJobs="$(getconf _NPROCESSORS_ONLN)"
make -j $aNbJobs
make install

Build fails due to timeout

I have a project that is a wrapper for opencv library, written in Rust.
In order to be able to test it I have to build opencv itself. Then I cache it but cold build time is higher than 50 minutes and job gets killed.
How could this timeout be increased? For example, I have 50min per job timeout, but I'd like to have 500 minutes per 10 jobs, so I can run my first cold start build for say 90 minutes and then run fast build for 10 minutes each.
I don't know if it's possible so I'm looking for any workaround. Here is my script which takes most of time:
#!/bin/bash
set -eux -o pipefail
OPENCV_VERSION=${OPENCV_VERSION:-3.4.0}
URL=https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip
URL_CONTRUB=https://github.com/opencv/opencv_contrib/archive/${OPENCV_VERSION}.zip
INSTALL_DIR="$HOME/usr/installed-${OPENCV_VERSION}"
if [[ ! -e INSTALL_DIR ]]; then
TMP=$(mktemp -d)
OPENCV_DIR="$(pwd)/opencv-${OPENCV_VERSION}"
OPENCV_CONTRIB_DIR="$(pwd)/opencv_contrib-${OPENCV_VERSION}"
if [[ ! -d "${OPENCV_DIR}/build" ]]; then
curl -sL ${URL} > ${TMP}/opencv.zip
unzip -q ${TMP}/opencv.zip
rm ${TMP}/opencv.zip
curl -sL ${URL_CONTRUB} > ${TMP}/opencv_contrib.zip
unzip -q ${TMP}/opencv_contrib.zip
rm ${TMP}/opencv_contrib.zip
mkdir $OPENCV_DIR/build
fi
pushd $OPENCV_DIR/build
cmake \
-D WITH_CUDA=ON \
-D BUILD_EXAMPLES=OFF \
-D BUILD_TESTS=OFF \
-D BUILD_PERF_TESTS=OFF \
-D BUILD_opencv_java=OFF \
-D BUILD_opencv_python=OFF \
-D BUILD_opencv_python2=OFF \
-D BUILD_opencv_python3=OFF \
-D CMAKE_INSTALL_PREFIX=$HOME/usr \
-D CMAKE_BUILD_TYPE=Release \
-D OPENCV_EXTRA_MODULES_PATH=$OPENCV_CONTRIB_DIR/modules \
-D CUDA_ARCH_BIN=5.2 \
-D CUDA_ARCH_PTX="" \
..
make -j4
make install && touch INSTALL_DIR
popd
touch $HOME/fresh-cache
fi
sudo cp -r $HOME/usr/include/* /usr/local/include/
sudo cp -r $HOME/usr/lib/* /usr/local/lib/
How could this timeout be increased?
According to the Travis docs it's not possible and the timeout is fixed to 50 min (travis-ci.org) respectively 120 min (travis-ci.com).
You could consider to upgrade the travis plan. Though, the real problem is not the timeout but the necessity to build a huge library before each build. Even tough caching improves the situation a bit, it's still bad.
There are some ways to to reduce the build time (per build) – what fits best for you depends on your situation of course.
A. PPA
If you are luckky and there's a PPA shipping a version of OpenCV you can use that one. Travis runs Ubuntu 14.04 Trusty.
B. Pre-build binaries
You always can build OpenCV your own and upload pre-build binaries to eg. a server or different Git repo. Then Travis can then download and install then there.
C. Docker
Docker is imo the best approach to this. Either create a custom Docker Image or use exiting ones (there are enough around). A good start to look for are DockerHub and GitHub. In addition this way enables you to pack any further dependencies, compiler, … – simply everything you need.
D. Contact Travis
You can always drop an issue at Travis and ask for an updated version of OpenCV.

Creates package but no export

My job completes with no error. The logs show "accuracy", "auc", and other statistical measures of my model. ML-engine creates a package subdirectory, and a tar under that, as expected. But, there's no export directory, checkpoint, eval, graph or any other artifact that I'm accustom to seeing when I train locally. Am I missing something simple with the command I'm using to call the service?
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.0 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--model_type wide \
--train_data $TRAIN_DATA \
--test_data $TEST_DATA \
--train_steps 1000 \
--verbose-logging true
The logs show this: model directory = /tmp/tmpS7Z2bq
But I was expecting my model to go to the GCS bucket I defined in $OUTPUT_PATH.
I'm following the steps under "Run a single-instance trainer in the cloud" from the getting started docs.
Maybe you could show where and how you declare the $OUTPUT_PATH?
Also the model directory, might be the directory within the $OUTPUT_PATH where you could find the model of that specific Job.

Resources