How to write my own parser in Apache Nutch 1.6 binary distribution to extract html pages? - html-parsing

I am trying to write my own custom parser in Apache Nutch 1.6 binary distribution, but I can't find any help anywhere.

Related

Reverse engineering a ELF binary containing LUA bytecode

I have to reverse engineer a ELF binary file that also contains LUA bytecode, what would be the best approach for extracting the LUA bytecode to decompile it with luadec or similar tools ?
Currently I loaded the binary in Ghidra, mostly understood the functionality and how the lua code is loaded but I'am not very experienced in such things.
The binary uses luaL_readbuffer() to load the scripts, those scripts seems to be embedded in the binary files as variables.
Thanks!

docbook saxon toolchain extension jar file -- cannot find same

To set up the docbook toolchain with Saxon, in our classpath,
I understand we need a JAR file with some extensions.
Now I can't find it on the internet.
Here is the information from Bob Stayton's wonderful book on "Docbook: XSL"
The DocBook stylesheets have some custom extension functions written specifically for the Saxon processor. These functions are contained in a saxon653.jar file that is included with the DocBook distribution in its extensions subdirectory. There may be several saxon jar files there, labeled by the version number of Saxon. Use the one closest to your Saxon version number. See the section “DocBook Saxon and Xalan extensions” for a more complete description of the DocBook Saxon extensions.
I had all this set up on the Computer Science server at the University where
I teach. Unfortunately, that server was lost. I am trying to recreate
the toolchain. I use docbook to create the class notes for two of my courses. And I need this to set up my classes for the Spring 2017 semester.
I located this in the sourceforge docbook project under files
docbook-xsl-saxon-1.00
NOT the docbook-xsl-doc-1.79.1.zip to which webpage points

Parse Apache Thrift file

I have a huge Apache thrift file, which I need to parse and store information as per my application.
I could do this manually, reading line by line.
But it is always prone to errors and what not. So is there some API etc that I can use to parse the file fast and efficiently?
If not, any other suggestions?
Facebook's Swift tool has a Thrift IDL parser implemented in Java if that fits into your project: https://github.com/facebook/swift/tree/master/swift-idl-parser. If your application is .NET, you might still be able to use this library if you can translate the parser JAR using IKVM.NET. There is an ANTLR grammar in there someplace too if you want to develop your own parser.
Alternatively, I noticed that the thrift trunk now has a JSON generator that outputs the IDL as a JSON data structure, which should be easy enough to parse in any language. You'll probably need to compile from source in order to use that generator, but Thrift picks up new features so fast that you might want to do that anyways if you are not already.
thrift cli can help you, you can generate json from thrift file, and then parse json to get struct of thrift file
thrift --gen json example.thrift

xText parser usage during runtime configuration

I want to use the runtime configuration for running an xText parser. In an example xText project I get the standalone and the runtime configuration for using the parser.
Please can somebody indicate the steps needed to use the parser during runtime in another Eclipse plug-in project. I have no experience with the plugin.xml file and I know I need to create there some extension points.
The xText sample project contain also an ui project which uses the obtained parser during runtime. Still I was not able to understand what things I need from that configuration an what not.
Help is highly appreciated.

Any tools for clojure to parse java source code? [duplicate]

I'm trying to analyze Java source files with Clojure but I couldn't find a way to do that.
First, I thought using Eclipse AST plugin(by copying necessary JAR's to my Clojure project) but I gave up after seeing Eclipse AST's API(visitor based walker).
Then I've tried creating a Java parser with ANTLR. I can only find one Java 1.6 grammar for ANTLR( http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g ) and it doesn't compile with latest ANTLR(here's the errors I'm getting ).
Now I have no idea how can I do that. At worst I'll try to go with Eclipse AST.
Does anyone know a better way to parse Java files with Clojure?
Thanks.
Edit: To clarify my point:
I need to find some specific method calls in Java projects and inspect it's parameters(we have multiple definitions of the method, with different type of parameters). Right now I have a simple solution written in Java(Eclipse AST) but I want to use Clojure in this project as much as possible.
... and it doesn't compile with latest ANTLR ...
I could not reproduce that.
Using ANTLR v3.2, I got some warnings, but no errors. Using both ANTLR v3.3 and v3.4 (latest version), I have no problems generating a parser.
You didn't mention how you're (trying) to generate a lexer/parser, but here's how it works for me:
java -cp antlr-3.4.jar org.antlr.Tool Java.g
EDIT 1
Here's my output when running the commands:
ls
wget http://www.antlr.org/download/antlr-3.4-complete.jar
wget http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
java -cp antlr-3.4-complete.jar org.antlr.Tool Java.g
ls
As you can see, the .java files of the lexer and parser are properly created.
EDIT 2
Instead of generating a parser yourself (from a grammar), you could use an existing parser like this one (Java 1.5 only AFAIK) and call it from your Clojure code.
It depends a bit on what you want to do - what are you hoping to get from the analysis?
If you want to actually compile Java or at least build an AST, then you probably need to go the ANTLR or Eclipse AST route. Java isn't that bad of a language to parse, but you still probably don't want to be reinventing too many wheels..... so you might as well build on the Eclipse and OpenJDK work.
If however you are just interesting in parsing the basic syntax and analysing certain features, it might be easier to use a simpler general purpose parser combinator library. Options to explore:
fnparse (Clojure, not sure how well maintained)
jparsec (Java, but can probably be used quite easily from Clojure)

Resources