Can't read Mahout generated sequence files with hadoop streaming - mahout

I am trying to stream a sequence file generated by one of the Mahout examples to see its contents:
hadoop jar hadoop-streaming-0.20.2-cdh3u0.jar \
-input /tmp/mahout-work-me/20news-bydate/bayes-test-input-output/ \
-output /tmp/me/mm \
-mapper "cat" \
-reducer "wc -l" \
-inputformat SequenceFileAsTextInputFormat
The job starts successfully and eventually dies with:
11/11/30 21:08:39 INFO streaming.StreamJob: map 0% reduce 0%
11/11/30 21:09:17 INFO streaming.StreamJob: map 100% reduce 100%
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.mahout.common.StringTuple
I wonder if something is wrong with my streaming jar file, if I I need to point explicitly to the Mahout jar that has this class (tried setting HADOOP_CLASSPATH to the location of mahout-core-0.5-cdh3u2.jar but did not work), or maybe even something else?
Any help is appreciated. Thanks.

Add this option:
-libjars mahout-core-0.5-cdh3u2.jar

Related

Is there a guide on porting edk2 to a new ARM64 platform?

I am new to EDK2.
For porting ekd2 firmware to a new ARM64 platform, it would be good to first get a minimum edk2 port which can run UEFI Shell at least, improvements can be added gradually based on that.
It seems that the first step is rather steep, e.g., how to determine a minimal set of "items" in .dsc and .fdf file for a platform? In my case, I would like to build the .fd for my platform and treat it as BL33 of TF-A, effectively I would like to build an edk2 firmware to replace u-boot.
It seems that such a guide is hard to find on the web. I found a old version of edk2 which contains some instructions, but apparently they are obsolete (not exist in latest master branch, while can be found in UDK branches such as UDK2014), and I am not sure why those documents are removed from master branch.
Currently I can build .fd for FVP (edk2-platforms/Platform/ARM/VExpressPkg/ArmVExpress-FVP-AArch64.dsc), and it seems that the build output FVP_AARCH64_EFI.fd is supposed to be treated as BL33. Theoretically this could be a prototype for my new ARM64 platform, but to me it's too complex to start with: the firmware is about 2.5MiB in size (as compare to 500K of u-boot), so I guess it's far from a "minimum" version. but it's hard to figure out what features to be removed (and how).
I am wondering if there is a detailed guide on such topic...
After 1 month of trial and error, today I managed to bring my ARM64 platform into a UEFI Shell environment. I treat it as my 1st milestone on the EDK2 journey. Below I will try to summarize the steps I took so far, as a tentative answer to my question above. Guidance/corrections/comments are welcomed.
Get familiar with UEFI/PI spec and EDK2 implementation by reading books/specs/articles. Well, UEFI/PI specs are thousands of pages long...how to start? My main reading list is:
"Beyond Bios--Developing with the Unified Extensible Firmware Interface", 3rd ed, by Vincent Zimmer, et al. As the authors explained, the book is a kind of high level summary of the thousands-paged specs. And I find that the book is well organized for a new comer to get familiar with various UEFI related concepts. The main purpose of the 1st read (before playing with edk2 code base) is to get familiar with concepts and architectural ideas, not the details yet. Related sections need to be consulted later when reading EDK2 implementations.
EDK2 specs, including:
EDKII User Manual
EDKII Build Specification
EDKII DSC/FDF/DEC/INF File Specification
Various articles on the web...
Get a reference platform which can correctly boot a FD image built from latest EDK2 source, and play with the boot manager and Shell environment a bit. In my case, I chose RPi4B. For me, this is very important, as the reference platform serves as a handrail during the whole process, that whenever I encounter bugs or have doubts, I check the source/log of the reference platform. This solves most of the problems I encountered. Btw, always generating "build log" and "build report" for both reference platform and the target platform, as the two files contains very detailed information for comparison and check. Consult the EDK2 build spec on how to generate these two files during build.
I use the following script to build for RPi4B platform:
#!/bin/bash
# https://github.com/tianocore/edk2-platforms#how-to-build-linux-environment
export WORKSPACE=/home/bruin/work/tianocore
export PACKAGES_PATH=$WORKSPACE/edk2:$WORKSPACE/edk2-platforms:$WORKSPACE/edk2-non-osi
pushd $WORKSPACE
rm -rf ./Build/RPi4
source edk2/edksetup.sh
echo "Building BaseTools..."
make -C edk2/BaseTools all
#sudo apt install acpica-tools # iasl
# pip install antlr4-python3-runtime # -Y EXECUTION_ORDER
echo "Building firmware for Pi4B..."
GCC5_AARCH64_PREFIX=aarch64-none-linux-gnu- build \
-n 4 \
-a AARCH64 \
-p Platform/RaspberryPi/RPi4/RPi4.dsc \
-t GCC5 \
-b NOOPT \
-v -d 9 -j RPi4-build.log \
-y RPi4-build-report.txt \
-Y PCD \
-Y LIBRARY \
-Y DEPEX \
-Y HASH \
-Y BUILD_FLAGS \
-Y FLASH \
-Y FIXED_ADDRESS \
-Y EXECUTION_ORDER \
all
How to use the build result RPI_EFI.fd on RPi4B, consult the following:
edk2-platforms/Platform/RaspberryPi/RPi4/Readme.md
readme.md inside https://github.com/pftf/RPi4/releases/download/v1.17/RPi4_UEFI_Firmware_v1.32.zip. btw, I need to replace the original start4.elf and fixup4.dat with the ones in the zip file, otherwise, the boot of RPi4 will fail, complaining something like below:
RpiFirmwareGetClockRate: Get Clock Rate return: ClockRate=0 ClockId=C
ASSERT [ArasanMMCHost] /home/bruin/work/tianocore/edk2-platforms/Platform/RaspberryPi/
Drivers/ArasanMmcHostDxe/ArasanMmcHostDxe.c(263): BaseFrequency != 0
It's worth to analysis the RPI_EFI.fd content to some extend, by using some UEFI utilities. I mainly use the GUI version UEFITool of sudo apt install uefitool uefitool-cli. Other tools are also available. The anotomy of RPI_EFI.fd is of help when reading EDK2 build specs for checking understanding of the concepts.
One special aspect of RPI_EFI.fd is that the 1st 128K is bl31.bin binary from ATF. I guess this is due to the special booting connfiguration methods for RPi. For my platform, I don't need such kind of packaging, I only need to build the UEFI image MY.fd, which is treated as BL33 image and packaged into fip.bin togehter with BL2 and BL31 images by ATF build script.
Another aspect to notice is the "reset vector" in the begining of the .fd file. This related to the entry point of UEFI image (and entry point of each EDK2 modules), as well as interpreting the BL instruction for AArch64. Basically, it can be summarized as below:
The first [Components] in RPI_EFI.fd is ArmPlatformPkg/PrePi/PeiUniCore.inf, which is of MODULE_TYPE = SEC.
What's this component: this is the first (and only) SEC (Security) module in RPi4. What the name PrePi and Pei implies?
... the PI spec is not tied to edk2 PEIMs, and I don't see where EDKII PEI modules are currently the only "acknowledged" silicon init environment. The edk2 tree itself seems to contain platforms that don't use the edk2 PEI module set at all, but (IIRC) jump from SEC to DXE. I believe "ArmPlatformPkg/PrePi" and "ArmVirtPkg/PrePi" are related to this.
--- https://listman.redhat.com/archives/edk2-devel-archive/2020-November/msg00021.html
Its entry point: all UEFI components have the same entry point (_ModuleEntryPoint).
By "component", it means either a UEFI driver and UEFI app, both are PE32 executables, usually with suffix .efi.
The .efis are converted from ELF executables (.dll) by GenFw tool: modifying the file headers.
To verify that "all components' entry point is _ModuleEntryPoint":
Check the .dll generating command line in build report (build -y <BUILD_REPORT_FILE>), we have two flags "aarch64-none-linux-gnu-gcc" -o xxx.dll -u _ModuleEntryPoint -Wl,-e,_ModuleEntryPoint ...:
-u: gcc --help -v|grep "undefined SYMBOL" gives -u SYMBOL --undefined SYMBOL: star with undefined reference to SYMBOL.
Wl,-e: ld --help|grep "entry" gives -e ADDRESS, --entry ADDRESS Set start address.
Check all .dll files that Entry point address == _ModuleEntryPoint: find . -type f -name "*.dll" -exec sh -c "readelf -a {} |grep -E 'Entry point address|_ModuleEntryPoint'" \;
Its entry point is the entry point of whole UEFI FD image (i.e., from bl33_base_addr jump to this _ModuleEntryPoint):
Topology of the UEFI Firmware File
A UEFI Firmware File (actually a UEFI Firmware Device - FD file) is a collection of UEFI binaries encapsulated into a single image. The format of this image is defined by the Platform Initialization Specification Volume 3. A Vector Table is located at the base of this file. A 'BL' branch instruction at the base of the firwmare (location of the Reset Entry into the Vector Table) will jump to the first 'SEC' module of the UEFI Firmware Image.
--- https://github.com/lzeng14/tianocore/wiki/ArmPkg-Debugging
To verify the statements above:
Disassember the reset vector (i.e., the 1st word) of generated .FD (we got offset=0x360):
$ xxd -l 4 -e TEST.fd <== dump 4 bytes in little endian
00000000: 140000d8 <== BL {PC}+(0xd8<<2); offset=0x360
Check the Entry point in .dll (we got offset=0x240):
$ aarch64-none-elf-objdump -t ArmPlatformPrePiUniCore.dll|grep _ModuleEntryPoint
0000000000000240 g F .text 0000000000000000 _ModuleEntryPoint
$ readelf -h ArmPlatformPrePiUniCore.dll|grep Entry
Entry point address: 0x240
Compare contents of two files at different offset (we got identicial content):
$ xxd -s 0x360 -l 64 TEST.fd <== skip 0x360 bytes, dump 64 bytes
00000360: 901e 0094 050a 0094 ea03 00aa a1cd 0a58 ...............X
00000370: 0200 e0d2 2200 c0f2 0240 a0f2 0200 80f2 ...."....#......
00000380: c303 a0d2 e3ff 9ff2 6304 00d1 6300 028b ........c...c...
00000390: 0400 a1d2 0400 80f2 2000 03eb 8400 0054 ........ ......T
$ xxd -s 0x240 -l 64 ArmPlatformPrePiUniCore.dll <== skip 0x240 bytes
00000240: 901e 0094 050a 0094 ea03 00aa a1cd 0a58 ...............X
00000250: 0200 e0d2 2200 c0f2 0240 a0f2 0200 80f2 ...."....#......
00000260: c303 a0d2 e3ff 9ff2 6304 00d1 6300 028b ........c...c...
00000270: 0400 a1d2 0400 80f2 2000 03eb 8400 0054 ........ ......T
Prepare an empty pkg, and make it build ok. The main purpuse is to do some exercise with EDK2 build system, and use the empty pkg as the start point for the new platform.
Make a copy of RaspberryPi.dec, change all gRaspberry to gMyPlatform.
Make a copy of RPi4.dsc and RPi4.fdf, and comment out all stuff in DSC and FDF file.
Replace all GUIDs in DSC/FDF/DEC files, generating new ones using online guid generator.
Note that PCD are declared in DEC files, and DEC files are refered by modules (INF files). As the empty package contains no module, no PCD definition will be available in FDF. So for a success build of the empty package, we need to comment out all PCD reference in FDF.
The NOOPT build command for MyPlatform is as below:
#!/bin/bash
export WORKSPACE=/home/bruin/work/tianocore
export PACKAGES_PATH=$WORKSPACE/edk2:$WORKSPACE/edk2-platforms:$WORKSPACE/edk2-non-osi
pushd $WORKSPACE
source edk2/edksetup.sh
echo "Building BaseTools..."
make -C edk2/BaseTools all
echo "Building UEFI firmware for MyPlatform..."
GCC5_AARCH64_PREFIX=aarch64-none-linux-gnu- build \
-n 4 \
-a AARCH64 \
-p Platform/MyCorp/MyPlatform/MyPlatform.dsc \
-t GCC5 \
-b NOOPT \
-v -d 9 -j MyPlatform-build.log \
-y MyPlatform-build-report.txt \
-Y EXECUTION_ORDER \
-Y PCD \
-Y LIBRARY \
-Y DEPEX \
-Y HASH \
-Y BUILD_FLAGS \
-Y FLASH \
-Y FIXED_ADDRESS \
all
popd
Add the 1st component ArmPlatformPrePiUniCore. This component is to prepare the HOBs for DXE phae. The main purpose is to get serial port working and memory config correct. Another purpose of this step is to familiar with steps for adding a component/module/lib. Below is a brief summary of the steps:
Uncomment the module's INF into both DSC ([Components] section), and FDF ([FV.FVMAIN_COMPACT]).
Rebuild the pkg, and resolve all Instance of library class [xxxLib] is not found errors reported, by updating [LibraryClasses] sections of DSC.
This step is a repeating process for dozens of times.
Some lib-class has multiple lib-instances, making sure choose the appropriate lib-instance (ref the build-report of RPi4).
if encounter ModuleEntryPoint.iiii:31: Error: immediate out of range: enable gArmTokenSpaceGuid.PcdFdBaseAddress and gArmTokenSpaceGuid.PcdFdSize in FDF.
if encounter undefined reference to _gPcd_BinaryPatch_PcdSerialClockRate: set PcdSerialClockRate in [PcdsPatchableInModule] section in DSC. FIXME: why? ref.
Check the PCDs listed in build log: inspect any abnormal PCD values, and supply correct values.
Customize platform-specific drivers or libraries.
SerialPortLib: locate the lib-class header file (MdePkg/Include/Library/SerialPortLib.h) by find edk2 -type f -name "*.dec" -exec grep -Hn SerialPortLib. The following functions are required:
SerialPortInitialize()
SerialPortWrite()
SerialPortRead()
SerialPortPoll()
SerialPortSetControl(): RETURN_UNSUPPORTED
SerialPortGetControl(): RETURN_UNSUPPORTED
SerialPortSetAttributes(): RETURN_UNSUPPORTED
ArmPlatformLib: interface header at Include/Library/ArmPlatformLib.h. The following functions are required:
ArmPlatformGetCorePosition(): return cpu idx in the cluster given the MPIDR value. this function is used in _ModuleEntryPoint for setting stack for secondary cores. Assuming one cluster for now.
ArmPlatformIsPrimaryCore()
ArmPlatformGetPrimaryCoreMpId()
ArmPlatformGetBootMode()
ArmPlatformPeiBootAction()
ArmPlatformInitialize()
ArmPlatformGetVirtualMemoryMap()
ArmPlatformGetPlatformPpiList()
etc...
Uncomment more modules in DSC/FDF, module by module...For driver/libs which are RPi platform specific, we can:
either search the edk2/edk2-platform for similiar driver or lib instances, or
copy the RPi4 implementation and comment out most of the content, make the pkg build success first, and then bug fixing.
Debugging: my current main debugging method is through adding "printf()", i.e., the edk2 macro DEBUG((DEBUG_INFO,)). One needs to set gEfiMdePkgTokenSpaceGuid.PcdDebugPrintErrorLevel to an appropriate value to see more debug info.

Fortify, how to start analysis through command

How we can generate FortiFy report using command ??? on linux.
In command, how we can include only some folders or files for analyzing and how we can give the location to store the report. etc.
Please help....
Thanks,
Karthik
1. Step#1 (clean cache)
you need to plan scan structure before starting:
scanid = 9999 (can be anything you like)
ProjectRoot = /local/proj/9999/
WorkingDirectory = /local/proj/9999/working
(this dir is huge, you need to "rm -rf ./working && mkdir ./working" before every scan, or byte code piles underneath this dir and consume your harddisk fast)
log = /local/proj/9999/working/sca.log
source='/local/proj/9999/source/src/**.*'
classpath='local/proj/9999/source/WEB-INF/lib/*.jar; /local/proj/9999/source/jars/**.*; /local/proj/9999/source/classes/**.*'
./sourceanalyzer -b 9999 -Dcom.fortify.sca.ProjectRoot=/local/proj/9999/ -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/proj/working/9999/working/sca.log -clean
It is important to specify ProjectRoot, if not overwrite this system default, it will put under your /home/user.fortify
sca.log location is very important, if fortify does not find this file, it cannot find byte code to scan.
You can alter the ProjectRoot and Working Directory once for all if your are the only user: FORTIFY_HOME/Core/config/fortify_sca.properties).
In such case, your command line would be ./sourceanalyzer -b 9999 -clean
2. Step#2 (translate source code to byte code)
nohup ./sourceanalyzer -b 9999 -verbose -64 -Xmx8000M -Xss24M -XX:MaxPermSize=128M -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParallelGC -Dcom.fortify.sca.ProjectRoot=/local/proj/9999/ -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/proj/9999/sca.log -source 1.5 -classpath '/local/proj/9999/source/WEB-INF/lib/*.jar:/local/proj/9999/source/jars/**/*.jar:/local/proj/9999/source/classes/**/*.class' -extdirs '/local/proj/9999/source/wars/*.war' '/local/proj/9999/source/src/**/*' &
always unix background job (&) in case your session to server is timeout, it will keep working.
cp : put all your known classpath here for fortify to resolve the functiodfn calls. If function not found, fortify will skip the source code translation, so this part will not be scanned later. You will get a poor scan quality but FPR looks good (low issue reported). It is important to have all dependency jars in place.
-extdir: put all directories/files you don't want to be scanned here.
the last section, files between ' ' are your source.
-64 is to use 64-bit java, if not specified, 32-bit will be used and the max heap should be <1.3 GB (-Xmx1200M is safe).
-XX: are the same meaning as in launch application server. only use these to control the class heap and garbage collection. This is to tweak performance.
-source is java version (1.5 to 1.8)
3. Step#3 (scan with rulepack, custom rules, filters, etc)
nohup ./sourceanalyzer -b 9999 -64 -Xmx8000M -Dcom.fortify.sca.ProjectRoot=/local/proj/9999 -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/ssap/proj/9999/working/sca.log **-scan** -filter '/local/other/filter.txt' -rules '/local/other/custom/*.xml -f '/local/proj/9999.fpr' &
-filter: file name must be filter.txt, any ruleguid in this file will not be reported.
rules: this is the custom rule you wrote. the HP rulepack is in FORTIFY_HOME/Core/config/rules directory
-scan : keyword to tell fortify engine to scan existing scanid. You can skip step#2 and only do step#3 if you did notchange code, just want to play with different filter/custom rules
4. Step#4 Generate PDF from the FPR file (if required)
./ReportGenerator -format pdf -f '/local/proj/9999.pdf' -source '/local/proj/9999.fpr'

extract a file from xz file

I have a huge file file.tar.xz containing many smaller text files with a similar structure. I want to quickly examine a file out of the compressed file and have a glimpse of files content structure. I don't have information about names of the files within the compressed file. Is there anyway to extract a single file out given the above the above scenario?
Thank you.
EDIT: I don't want to tar -xvf file.tar.xz.
Based on the discussion in the comments, I tried the following which worked for me. It might not be the most optimal solution, the regex might need some improvement, but you'll get the idea.
I first created a demo archive:
cd /tmp
mkdir demo
for i in {1..100}; do echo $i > "demo/$i.txt"; done
cd demo && tar cfJ ../demo.tar.xz * && cd ..
demo.tar.xz now contains 100 txt files.
The following lists the contents of the archive, selects the first file and stores the path within the archive into the variable firstfile:
firstfile=`tar -tvf demo.tar.xz | grep -Po -m1 "(?<=:[0-9]{2} ).*$"`
echo $firstfile will output 1.txt.
You can now extract this single file from the archive:
tar xf demo.tar.xz $firstfile

What is the truly correct usage of -S parameter on weka classifier A1DE?

So I'm using weka 3.7.11 in a Windows machine (and runnings bash scripts with cygwin), and I found an inconsistency regarding the AODE classifier (which in this version of weka, comes from an add-on package).
Using Averaged N-Dependencies Estimators from the GUI, I get the following configuration (from an example that worked alright in the Weka Explorer):
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" -W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
So I modified this to get the following command in my bash script:
java -Xmx60G -cp "C:\work\weka-3.7.jar;C:\Users\Oracle\wekafiles\packages\AnDE\AnDE.jar" weka.classifiers.meta.FilteredClassifier \
-t train_2.arff -T train_1.arff \
-classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -p 1 -file predictions_final_multi.csv -suppress" \
-threshold-file umbral_multi.csv \
-F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" \
-W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
But this gives me the error:
Weka exception: No value given for -S option.
Which is weird, since this was not a problem with the GUI. In the GUI, the Information box says that -S it's just a flag ("Subsumption Resolution can be achieved by using -S option"), so it shouldn't expect any number at all, which is consistent with what I got using the Explorer.
So then, what's the deal with the -S option when using the command line? Looking at the error text given by weka, I found this:
Options specific to classifier weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE:
-D
Output debugging information
-F <int>
Impose a frequency limit for superParents (default is 1)
-M <double>
Specify a weight to use with m-estimate (default is 1)
-S <int>
Specify a critical value for specialization-generalilzation SR (default is 100)
-W
Specify if to use weighted AODE
So it seems that this class works in two different ways, depending on which method I use (GUI vs. Command Line).
The solution I found, at least for the meantime, was to write -S 100 on my script. Is this really the same as just putting -S in the GUI?
Thanks in advance.
JM
I've had a play with this Classifier, and can confirm that what you are experiencing on your end is consistent with what I have here. From the GUI, the -S Option (subsumption Resolution) requires no parameters while the Command Prompt does (specialization-generalization SR).
They don't sound like the same parameter, so you may need to raise this issue with the developer of the third party package if you would like to know more information on these parameters. You can find this information from the Tools -> Package Manager -> AnDE, which will point you to the contacts for the library.

ffmpeg - Continuously stream webcam to single .jpg file (overwrite)

I have installed ffmpeg and mjpeg-streamer. The latter reads a .jpg file from /tmp/stream and outputs it via http onto a website, so I can stream whatever is in that folder through a web browser.
I wrote a bash script that continuously captures a frame from the webcam and puts it in /tmp/stream:
while true
do
ffmpeg -f video4linux2 -i /dev/v4l/by-id/usb-Microsoft_Microsoft_LifeCam_VX-5000-video-index0 -vframes 1 /tmp/stream/pic.jpg
done
This works great, but is very slow (~1 fps). In the hopes of speeding it up, I want to use a single ffmpeg command which continuously updates the .jpg at, let's say 10 fps. What I tried was the following:
ffmpeg -f video4linux2 -r 10 -i /dev/v4l/by-id/usb-Microsoft_Microsoft_LifeCam_VX-5000-video-index0 /tmp/stream/pic.jpg
However this - understandably - results in the error message:
[image2 # 0x1f6c0c0] Could not get frame filename number 2 from pattern '/tmp/stream/pic.jpg'
av_interleaved_write_frame(): Input/output error
...because the output pattern is bad for a continuous stream of images.
Is it possible to stream to just one jpg with ffmpeg?
Thanks...
You can use the -update option:
ffmpeg -y -f v4l2 -i /dev/video0 -update 1 -r 1 output.jpg
From the image2 file muxer documentation:
-update number
If number is nonzero, the filename will always be interpreted as just a
filename, not a pattern, and this file will be continuously overwritten
with new images.
It is possible to achieve what I wanted by using:
./mjpg_streamer -i "input_uvc.so -r 1280×1024 -d /dev/video0 -y" -o "output_http.so -p 8080 -w ./www"
...from within the mjpg_streamer's directory. It will do all the nasty work for you by displaying the stream in the browser when using the address:
http://{IP-OF-THE-SERVER}:8080/
It's also light-weight enough to run on a Raspberry Pi.
Here is a good tutorial for setting it up.
Thanks for the help!

Resources