Tika configuration for ZIP parser - apache-tika

Is it possible to tell Tika or the parser that a ZIP can only contain files with a certain MimeType or file extension?
What iam currently use is the recursive parser to get all the information for every file.
final ParseContext context = new ParseContext();
final ContentHandlerFactory contentHandlerFactory = new BasicContentHandlerFactory( BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1 );
final RecursiveParserWrapperHandler recursiveParserWrapperHandler = new RecursiveParserWrapperHandler( contentHandlerFactory );
final RecursiveParserWrapper parser = new RecursiveParserWrapper( autoDetectParser );
context.set( Parser.class, parser );
parser.parse( tikaInputStream, recursiveParserWrapperHandler, metadata, context );
I am looking for a solution that the zip can only contain one file type and cannot contain any other zip/container. Currently I'm doing this by hand, but maybe there's a better solution. Especially with regard to zip bombing, another solution makes more sense.
final String contentType = metadata1.get( Metadata.CONTENT_TYPE );
final MediaType mediaType = MediaType.parse( contentType );
final MediaType expectedMediaType = MediaType.text( "turtle" );
final String depth = metadata1.get( TikaCoreProperties.EMBEDDED_DEPTH );
if ( MediaType.APPLICATION_ZIP.equals( mediaType ) ) {
if ( Integer.parseInt( depth ) > 0 ) {
throw new RuntimeException( "Not allowed depth path" );
}
return;
}
if ( !expectedMediaType.equals( mediaType ) ) {
throw new RuntimeException( "Not allowed media type" );
}

I added a own RecursiveParserWrapperHandler. Here is an example when the maximum embedded count is reached an exception is thrown.
public class ZipHandler extends RecursiveParserWrapperHandler {
private static final int MAX_EMBEDDED_ENTRIES = 10000;
public ZipHandler( final ContentHandlerFactory contentHandlerFactory ) {
super( contentHandlerFactory, MAX_EMBEDDED_ENTRIES );
}
#Override
public void endDocument( final ContentHandler contentHandler, final Metadata metadata ) throws SAXException {
if ( hasHitMaximumEmbeddedResources() ) {
throw new SAXException( "Max embedded entries reached: " + MAX_EMBEDDED_ENTRIES );
}
super.endDocument( contentHandler, metadata );
}
}

Related

Error executing: Java.Lang.Class[] parameterType = new Java.Lang.Class[] { Java.Lang.Class.FromType(typeof(sbyte[])) }; in Xamarin.Android

I'm trying to compile an android vector image in Xamarin Android based on a Java Sample but The system shows the following error:
{System.ArgumentException: type Parameter name: Type is not derived from a java type. at Java.Lang.Class.FromType (System.Type type) [0x00012] in the following part of the code:
Java.Lang.Class[] parameterType = new Java.Lang.Class[] { Java.Lang.Class.FromType(typeof(sbyte[])) };
Basically
private Drawable GetVectorDrawable(Context context, byte[] binXml)
{
try
{
// Get the binary XML parser (XmlBlock.Parser) and use it to create the drawable
// This is the equivalent of what AssetManager#getXml() does
// get the Class instance using forName method
var xmlBlock = Java.Lang.Class.ForName("android.content.res.XmlBlock");
System.Diagnostics.Debug.WriteLine(typeof(sbyte[]));
Java.Lang.Class[] parameterType = new Java.Lang.Class[] { Java.Lang.Class.FromType(typeof(sbyte[])) };
var xmlBlockConstr = xmlBlock.GetConstructor(parameterType);
//var xmlBlockConstr = xmlBlock.GetConstructor(Class.FromType(typeof(sbyte[])));
var xmlParserNew = xmlBlock.GetDeclaredMethod("newParser");
xmlBlockConstr.Accessible = true;
xmlParserNew.Accessible = true;
var parser = xmlParserNew.Invoke(xmlBlockConstr.NewInstance(binXml)) as XmlPullParser;
//System.Xml.XmlReader reader = parser.ToString();
//if (Build.VERSION.SdkInt >= (BuildVersionCodes)24)
//{
//ByteArray
//Drawable.CreateFromXml(context.Resources, reader);
//}
}
catch (System.Exception e)
{
}
return null;
}
This is the constructor I try to get:
public XmlBlock(byte[] data) {
mAssets = null;
mNative = nativeCreate(data, 0, data.length);
mStrings = new StringBlock(nativeGetStringBlock(mNative), false);
}
And this is the Java code I tried to traduce:
/**
* Create a vector drawable from a binary XML byte array.
* #param context Any context.
* #param binXml Byte array containing the binary XML.
* #return The vector drawable or null it couldn't be created.
*/
public static Drawable getVectorDrawable(#NonNull Context context, #NonNull byte[] binXml) {
try {
// Get the binary XML parser (XmlBlock.Parser) and use it to create the drawable
// This is the equivalent of what AssetManager#getXml() does
#SuppressLint("PrivateApi")
Class<?> xmlBlock = Class.forName("android.content.res.XmlBlock");
Constructor xmlBlockConstr = xmlBlock.getConstructor(byte[].class);
Method xmlParserNew = xmlBlock.getDeclaredMethod("newParser");
xmlBlockConstr.setAccessible(true);
xmlParserNew.setAccessible(true);
XmlPullParser parser = (XmlPullParser) xmlParserNew.invoke(
xmlBlockConstr.newInstance((Object) binXml));
if (Build.VERSION.SDK_INT >= 24) {
return Drawable.createFromXml(context.getResources(), parser);
} else {
// Before API 24, vector drawables aren't rendered correctly without compat lib
final AttributeSet attrs = Xml.asAttributeSet(parser);
int type = parser.next();
while (type != XmlPullParser.START_TAG) {
type = parser.next();
}
return VectorDrawableCompat.createFromXmlInner(context.getResources(), parser, attrs, null);
}
} catch (Exception e) {
Log.e(TAG, "Vector creation failed", e);
}
return null;
}
from this url: Given byte-array of VectorDrawable, how can I create an instance of VectorDrawable from it?
Can you help me?

Update Grid with a fresh set of data, in Vaadin 7.4 app

In the new Vaadin 7.4 release, the new Grid widget debuted as an alternative to the venerable Table.
After getting a Grid displayed, I later want to replace the entire set of data with fresh data. Rather than update the individual rows, I want to simply replace them.
I happen to be using a BeanItemContainer for easy read-only display of some objects with JavaBeans-style getter methods.
I considered two approaches:
Two step process of replacing bean items.
(1) First remove all BeanItem objects with Container::removeAllItems method.
(2) Then add replacement BeanItem objects with the BeanItemContainer::addAll method.
Replace entire BeanItemContainer.
Call Grid::setContainerDataSource and pass a new instance of BeanItemContainer constructed with fresh data.
Below is a sample application (Vaadin 7.4.2) showing both approaches. A pair of identical Grid widgets appear. Each has a button that updates data with either approach.
Results
The first approach (removing items and adding items) works. The fresh data immediately appears.
The second approach (replacing container rather than items) seems like it should work, with nothing contrary suggested in the scant documentation. But nothing happens. No exceptions or errors occur, yet no fresh data appears. I opened Ticket # 17268 on Vaadin trac for this issue.
Perhaps there are other better ways. Please post or comment with any alternatives.
Example App
Three classes are displayed below. You should be able to copy-paste into a new Vaadin 7.4.x app.
One class is the usual "MyUI" created in every new Vaadin app.
Another is simple JavaBeans-style class, "Astronomer", providing data for the rows in our Grid. That Astronomer class includes a convenient static method for generating a List of instances. Each new Astronomer gets a random number of popularity votes, to show fresh data values.
The meaty part of the example is in the "AstronomersLayout" class which creates the pair of Grids with their assigned buttons.
I use Java 8 Lambda syntax and the new java.time classes. So you may need to change your project's settings to use Java 8. In NetBeans 8 that means Project > Properties > Sources > Source/Binary Format (popup menu) > 1.8.
MyUI.java
Get your Vaadin app going.
package com.example.vaadingridexample;
import javax.servlet.annotation.WebServlet;
import com.vaadin.annotations.Theme;
import com.vaadin.annotations.VaadinServletConfiguration;
import com.vaadin.annotations.Widgetset;
import com.vaadin.server.VaadinRequest;
import com.vaadin.server.VaadinServlet;
import com.vaadin.ui.UI;
/**
* Example app in Vaadin 7.4.2 experimenting with two ways to replace data in a
* displayed Grid.
*
* #author Basil Bourque
*/
#Theme ( "mytheme" )
#Widgetset ( "com.example.vaadingridexample.MyAppWidgetset" )
public class MyUI extends UI
{
#Override
protected void init ( VaadinRequest vaadinRequest )
{
this.setContent( new AstronomersLayout() );
}
#WebServlet ( urlPatterns = "/*" , name = "MyUIServlet" , asyncSupported = true )
#VaadinServletConfiguration ( ui = MyUI.class , productionMode = false )
public static class MyUIServlet extends VaadinServlet
{
}
}
AstronomersLayout.java
The main part of the example.
package com.example.vaadingridexample;
import com.vaadin.data.util.BeanItemContainer;
import com.vaadin.shared.ui.grid.HeightMode;
import com.vaadin.ui.Button;
import com.vaadin.ui.Button.ClickEvent;
import com.vaadin.ui.Grid;
import com.vaadin.ui.VerticalLayout;
import java.time.ZoneOffset;
import java.time.ZonedDateTime;
import java.time.format.DateTimeFormatter;
import java.util.List;
/**
* Layout displays a pair of Grids, each with a Button to replace its contents
* with fresh data in either of two ways: (a) Replace all the items within the
* Container, or (b) Replace container itself.
*
* #author Basil Bourque
*/
#SuppressWarnings ( "serial" )
public class AstronomersLayout extends VerticalLayout
{
// -----| Member vars |--------------------------
Grid grid_ReplaceItems;
String gridCaption_ReplaceItems = "Astronomers - Replacing Items";
Button button_ReplaceItems;
Grid grid_ReplaceContainer;
String gridCaption_ReplaceContainer = "Astronomers - Replacing Container";
Button button_ReplaceContainer;
// -----| Constructor |--------------------------
public AstronomersLayout ()
{
this.prepareWidgets();
this.composeLayout();
}
// -----| Helper Methods |--------------------------
private void prepareWidgets ()
{
// Show updating a Grid by replacing the bean items within a container.
// Grid
List<Astronomer> listA = Astronomer.makeList();
BeanItemContainer<Astronomer> containerA = new BeanItemContainer<>( Astronomer.class , listA );
this.grid_ReplaceItems = new Grid( this.gridCaption_ReplaceItems , containerA );
//this.grid_ReplaceItems.setColumnOrder( "votes" , "givenName" , "surName" , "birthYear" );
this.grid_ReplaceItems.setColumnOrder( Astronomer.FIELD.VOTES.getName() , Astronomer.FIELD.GIVENNAME.getName() , Astronomer.FIELD.SURNAME.getName() , Astronomer.FIELD.BIRTHYEAR.getName() ); // Enum is a safer way of doing this: this.grid_ReplaceItems.setColumnOrder( "votes" , "givenName" , "surName" , "birthYear" );
this.grid_ReplaceItems.setHeightMode( HeightMode.ROW ); // Show all rows of data for this grid.
this.updateCaptionAndSize( this.grid_ReplaceItems , this.gridCaption_ReplaceItems );
// Button
this.button_ReplaceItems = new Button( "Replace Items" );
this.button_ReplaceItems.addClickListener( ( ClickEvent event ) -> {
#SuppressWarnings ( "unchecked" )
BeanItemContainer<Astronomer> bic = ( BeanItemContainer<Astronomer> ) this.grid_ReplaceItems.getContainerDataSource(); // Access existing container. Cast as need be.
bic.removeAllItems(); // Remove existing items.
bic.addAll( Astronomer.makeList() ); // Add fresh bean items to existing container.
this.updateCaptionAndSize( this.grid_ReplaceItems , this.gridCaption_ReplaceItems );
} );
// Show updating a Grid by replacing the container rather than its contents.
// Grid
List<Astronomer> listB = Astronomer.makeList();
BeanItemContainer<Astronomer> containerB = new BeanItemContainer<>( Astronomer.class , listB );
this.grid_ReplaceContainer = new Grid( this.gridCaption_ReplaceContainer , containerB );
this.grid_ReplaceContainer.setColumnOrder( Astronomer.FIELD.VOTES.getName() , Astronomer.FIELD.GIVENNAME.getName() , Astronomer.FIELD.SURNAME.getName() , Astronomer.FIELD.BIRTHYEAR.getName() );
this.grid_ReplaceContainer.setHeightMode( HeightMode.ROW ); // Show all rows of data for this grid.
this.updateCaptionAndSize( this.grid_ReplaceContainer , this.gridCaption_ReplaceContainer );
// Button
this.button_ReplaceContainer = new Button( "Replace Container" );
this.button_ReplaceContainer.addClickListener( ( ClickEvent event ) -> {
#SuppressWarnings ( "unchecked" )
BeanItemContainer<Astronomer> bic = new BeanItemContainer<>( Astronomer.class , listB ); // Create replacement container.
this.grid_ReplaceContainer.setContainerDataSource( bic );
this.updateCaptionAndSize( this.grid_ReplaceContainer , this.gridCaption_ReplaceContainer );
} );
}
private void updateCaptionAndSize ( final Grid grid , final String caption )
{
// Caption
grid.setCaption( caption + " ( updated " + this.now() + " )" ); // Update caption of Grid to indicate fresh data.
// Show all rows.
double h = grid.getContainerDataSource().size() > 0 ? grid.getContainerDataSource().size() : 3; // Cannot set height to zero rows. So if no data, set height to some arbitrary number of (empty) rows.
grid.setHeightByRows( h );
}
private void composeLayout ()
{
// Initialize this layout.
this.setMargin( true );
this.setSpacing( true );
// Content
this.addComponent( this.button_ReplaceItems );
this.addComponent( this.grid_ReplaceItems );
this.addComponent( this.button_ReplaceContainer );
this.addComponent( this.grid_ReplaceContainer );
}
// Helper method.
private String now ()
{
// Get current time in UTC. Truncate fractional seconds. Append a 'Z' to indicate UTC time zone.
return ZonedDateTime.now( ZoneOffset.UTC ).format( DateTimeFormatter.ISO_LOCAL_TIME ).substring( 0 , 8 ).concat( "Z" );
}
}
Astronomer.java
The data, the bean items, stored in a BeanItemContainer for display in a Grid.
A nested Enum provides a safer way to refer to the field names in the other class, AstronomersLayout for call to setColumnOrder.
package com.example.vaadingridexample;
import java.util.ArrayList;
import java.util.List;
/**
* Provides the beans to appear as rows in a BeanItemContainer backing a Grid.
*
* Note the static convenience method for generating a List of instances.
*
* #author Basil Bourque
*/
public class Astronomer
{
public enum FIELD
{
SURNAME( "surname" ),
GIVENNAME( "givenName" ),
BIRTHYEAR( "birthYear" ),
VOTES( "votes" );
private String name;
private FIELD ( String s )
{
this.name = s;
}
public String getName ()
{
return this.name;
}
}
// Members
private String surname;
private String givenName;
private Integer birthYear;
private Integer votes;
public Astronomer ( final String givenName , final String surName , final Integer birthYear )
{
this.surname = surName;
this.givenName = givenName;
this.birthYear = birthYear;
this.votes = this.random();
}
public static List<Astronomer> makeList ()
{
List<Astronomer> list = new ArrayList<>( 7 );
list.add( new Astronomer( "Hypatia" , "of Alexandria" , -370 ) );
list.add( new Astronomer( "Nicolaus" , "Copernicus" , 1473 ) );
list.add( new Astronomer( "Tycho" , "Brahe" , 1546 ) );
list.add( new Astronomer( "Giordano" , "Bruno" , 1548 ) );
list.add( new Astronomer( "Galileo" , "Galilei" , 1564 ) );
list.add( new Astronomer( "Johannes" , "Kepler" , 1571 ) );
list.add( new Astronomer( "Isaac" , "Newton" , 1643 ) );
list.add( new Astronomer( "Caroline" , "Herschel" , 1750 ) );
return list;
}
// ----| Helper Methods |----------------------------------
private Integer random ()
{
return ( int ) ( java.lang.Math.random() * 100 );
}
// ----| Bean Getters |----------------------------------
public String getSurname ()
{
return this.surname;
}
public String getGivenName ()
{
return this.givenName;
}
public Integer getBirthYear ()
{
return this.birthYear;
}
public Integer getVotes ()
{
return this.votes;
}
// ----| Object Superclass |----------------------------------
#Override
public String toString ()
{
return "Astronomer{ " + "surName=" + surname + " | givenName=" + givenName + " | birthYear=" + birthYear + " | votes=" + votes + " }";
}
}
You can simply get the record you have removed from button through clickListener with .getSelectedRow() .
After this you can remove your item from your grid with .removeItem().
IE:
Grid yourGrid = new Grid();
yourGrid.setContainerDataSource(yourData);
Button removeItem = new Button("Remove item");
removeItem.addClickListener(l -> {
Item selectedItem = (Item) yourGrid.getSelectedRow();
yourGrid.getContainerDataSource().removeItem(selectedItem);
});
Bye!

handleURI for http://AAA.BBB.CCC.DDD:8080/myapp/ uri: '' returns ambigious result (Vaadin 6)

In my Vaadin 6 application I sometimes get the following error:
SEVERE: Terminal error:
java.lang.RuntimeException: handleURI for http://AAA.BBB.CCC.DDD:8080/myapp/ uri: '' returns ambigious result.
at com.vaadin.ui.Window.handleURI(Window.java:432)
at com.vaadin.terminal.gwt.server.AbstractCommunicationManager.handleURI(AbstractCommunicationManager.java:2291)
at com.vaadin.terminal.gwt.server.CommunicationManager.handleURI(CommunicationManager.java:370)
at com.vaadin.terminal.gwt.server.AbstractApplicationServlet.handleURI(AbstractApplicationServlet.java:1099)
at com.vaadin.terminal.gwt.server.AbstractApplicationServlet.service(AbstractApplicationServlet.java:535)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
Accrording to Vaadin source it occurs in the following method:
public DownloadStream handleURI(URL context, String relativeUri) {
DownloadStream result = null;
if (uriHandlerList != null) {
Object[] handlers;
synchronized (uriHandlerList) {
handlers = uriHandlerList.toArray();
}
for (int i = 0; i < handlers.length; i++) {
final DownloadStream ds = ((URIHandler) handlers[i]).handleURI(
context, relativeUri);
if (ds != null) {
if (result != null) {
throw new RuntimeException("handleURI for " + context
+ " uri: '" + relativeUri
+ "' returns ambigious result.");
}
result = ds;
}
}
}
return result;
}
I actually create a DownloadStream in a column generator (in order to display images in a table):
public class ImageColumnGenerator implements Table.ColumnGenerator {
private static final Logger LOGGER = LoggerFactory.getLogger(ImageColumnGenerator.class);
public final static String IMAGE_FIELD = "image";
public Object generateCell(final Table aTable, final Object aItemId, final Object aColumnId) {
if (!IMAGE_FIELD.equals(aColumnId)) {
return null;
}
final BeanItem<UserProductImageBean> beanItem = (BeanItem<UserProductImageBean>)
aTable.getItem(aItemId);
final StreamResource streamResource = new StreamResource(new StreamResource.StreamSource() {
public InputStream getStream() {
return new ByteArrayInputStream(beanItem.getBean().getImageData());
}
},
beanItem.getBean().getFileName(),
MyApplication.getInstance());
LOGGER.debug("imageResource: " + streamResource);
final Embedded embedded = new Embedded("", streamResource);
return embedded;
}
}
beanItem.getBean().getImageData() is a byte array (byte[]) with image data, which I get from a web service.
MyApplication.getInstance() is defined as follows:
public class MyApplication extends Application implements ApplicationContext.TransactionListener
{
private static ThreadLocal<MyApplication> currentApplication =
new ThreadLocal<MyApplication> ();
public static MyApplication getInstance()
{
return currentApplication.get ();
}
}
What can I do in order to fix the aforementioned (severe) error?
As soon as nobody answer. I'm not at all expert in what hell it is above, but - try to find out on what kind of urls this error arise on, and do with them something before feed them to DownloadStream

cannot parse pdf using Tika1.3 (+lucene4.2)

im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also correct.
public class MultiParse {
public static void main(final String[] args) throws IOException,
SAXException, TikaException {
Parser parser = new AutoDetectParser();
File f = new File("d.pdf");
System.out.println("------------ Parsing a PDF:");
extractFromFile(parser, f);
}
private static void extractFromFile(final Parser parser,
final File f ) throws IOException, SAXException,
TikaException {
BodyContentHandler handler = new BodyContentHandler(10000000);
Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get(f);
parser.parse(is, handler, metadata, new ParseContext());
for (String name : metadata.names()) {
System.out.println(name + ":\t" + metadata.get(name));
}
}
}
OUTPUT:No errors, but ..not much either:(
------------ Parsing a PDF:
Content-Type: application/pdf

Understanding mahout classification output

I have trained mahout model for three categories Category_A,Category_B,Category_C using 20newsGroupExample , Now i want to classify my documents using this model. Can somebody help me to understand output i am getting from this model.
Here is my output
{0:-2813549.8786637094,1:-2651723.736745838,2:-2710651.7525975127}
According to output category of document is 1, But expected category is 2. Am i going right or something is missing in my code ?
public class NaiveBayesClassifierExample {
public static void loadClassifier(String strModelPath, Vector v)
throws IOException {
Configuration conf = new Configuration();
NaiveBayesModel model = NaiveBayesModel.materialize(new Path(strModelPath), conf);
AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
Vector st = classifier.classifyFull(v);
System.out.println(st.asFormatString());
System.out.println(st.maxValueIndex());
st.asFormatString();
}
public static Vector createVect() throws IOException {
FeatureVectorEncoder encoder = new StaticWordValueEncoder("text");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
String inputData=readData();
StringReader in = new StringReader(inputData);
TokenStream ts = analyzer.tokenStream("body", in);
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
Vector v1 = new RandomAccessSparseVector(100000);
while (ts.incrementToken()) {
char[] termBuffer = termAtt.buffer();
int termLen = termAtt.length();
String w = new String(termBuffer, 0, termLen);
encoder.addToVector(w, 1.0, v1);
}
v1.normalize();
return v1;
}
private static String readData() {
// TODO Auto-generated method stub
BufferedReader reader=null;
String line, results = "";
try{
reader = new BufferedReader(new FileReader("c:\\inputFile.txt"));
while( ( line = reader.readLine() ) != null)
{
results += line;
}
reader.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
return results;
}
public static void main(String[] args) throws IOException {
Vector v = createVect();
String mp = "E:\\Final_Model\\model";
loadClassifier(mp, v);
}
}

Resources