How to extract images from a file using Apache TIka? - apache-tika

I have a pdf (or any other type of files such as .doc, .ppt, etc) which contain text as well as images. How can I extract images from those files using Tika?
Can also run OCR on the extracted images using Tess4j or any other lib?
This is how I call Tika:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(writeLimit);
Metadata metadata = new Metadata();
InputStream stream = new FileInputStream("file.pdf");
parser.parse(stream, handler, metadata);
p.s. I have tika-app.jar.

The way to do this:
InputStream stream = new FileInputStream(inputFile);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(
Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); // need to add this to make
// sure recursive parsing
// happens!
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata, parseContext);
String text = handler.toString().trim();
1) Ensure that you have tesseract installed using 'tesseract-ocr-setup-3.05.00dev.exe' from:
https://sourceforge.net/projects/tesseract-ocr-alt/files/
and have its path (It will get installed in the program files, if windows) is placed in the PATH environment variable. Restart Windows if needed.
Pass any (yes any!) file and it will extract.
2) Download tess4j-3.0.0.jar from:
https://sourceforge.net/projects/tess4j/?source=typ_redirect
and refer this jar using:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.0.0</version>
</dependency>
then, these:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.13</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/net.java.dev.jna/jna -->
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.11</version>
</dependency>
However, if using Ubuntu, tesseract should be installed using apt-get.
It will work.

Related

vaadin 14+ - charts - change drillup button text

How do I change the text of the drillup button in a chart?
E.g. to translate it?
Example: How to change the Text of "Back to hauptserie"
The text can be set via Lang object using Lang#setDrillUpText.
Lang lang = new Lang();
lang.setDrillUpText("Zurück zur Hauptserie");
ChartOptions.get().setLang(lang);
https://vaadin.com/api/platform/23.3.5/com/vaadin/flow/component/charts/model/Lang.html#setDrillUpText(java.lang.String)
However in Vaadin 14 the Lang object is not supported. This improvement came post Vaadin 14. Chart version 21.0.9 however is known to work with Vaadin 14, and has the support for Lang object.
<dependency>
<groupId>com.vaadin</groupId>
<artifactId>vaadin-charts-flow</artifactId>
<version>21.0.9</version>
</dependency>

What are Google Play Integrity constants AES_KEY_SIZE_BYTES, AES_KEY_TYPE and EC_KEY_TYPE

When decrypting and verifying the Google Play Integrity verdict as per official docs (https://developer.android.com/google/play/integrity/verdict) the code snippet/samples shared uses these constants: AES_KEY_SIZE_BYTES, AES_KEY_TYPE and EC_KEY_TYPE
But the values of those are never mentioned. Can someone plase help, what are those values?
After searching hours on the internet, I came across a youtube video (Obtaining and Decoding the Integrity Verdict | Step 3 of Migrating to Play Integrity API) (obviously not from Google) which gave me the required answer. Here are the values for those constants:
AES_KEY_SIZE_BYTES: decryptionKeyBytes.length
AES_KEY_TYPE: AES
EC_KEY_TYPE: EC
So your final code should look something like this:
package com.example.sample
...
...
import org.apache.commons.codec.binary.Base64;
import org.jose4j.jwe.JsonWebEncryption;
import org.jose4j.jws.JsonWebSignature;
import org.jose4j.jwx.JsonWebStructure;
import org.jose4j.lang.JoseException;
...
...
// base64OfEncodedDecryptionKey is provided through Play Console.
byte[] decryptionKeyBytes =
Base64.decode(base64OfEncodedDecryptionKey, Base64.DEFAULT);
// Deserialized encryption (symmetric) key.
SecretKey decryptionKey =
new SecretKeySpec(
decryptionKeyBytes,
/* offset= */ 0,
decryptionKeyBytes.length,
"AES");
// base64OfEncodedVerificationKey is provided through Play Console.
byte[] encodedVerificationKey =
Base64.decode(base64OfEncodedVerificationKey, Base64.DEFAULT);
// Deserialized verification (public) key.
PublicKey verificationKey =
KeyFactory.getInstance("EC")
.generatePublic(new X509EncodedKeySpec(encodedVerificationKey));
If you are using maven make sure you added these dependancies:
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-playintegrity</artifactId>
<version>v1-rev20220904-2.0.0</version>
</dependency>
<dependency>
<groupId>org.bitbucket.b_c</groupId>
<artifactId>jose4j</artifactId>
<version>0.8.0</version>
</dependency>

Get a list of all files during multifileupload in the same batch

I'm using this upload addon:
<dependency>
<groupId>com.wcs.wcslib</groupId>
<artifactId>wcslib-vaadin-widget-multifileupload</artifactId>
<version>4.0</version>
</dependency>
I'm uploading multiple files and each file upload is processed in:
void handleFile(InputStream stream, String fileName, String mimeType, long length, int filesLeftInQueue);
but it gives me only info about currently processed file.
I need a list of all files to check if two files with the same name but different extensions are uploaded. I checked some components related to this upload but methods that could be useful are private.
How do I get a list of all files that are uploaded in the same batch?

Swedish Characters Not showing properly after sending xml message on a topic (Java-8)

There a tag in which i must set the string "Vänligen Klicka på länken".
But its showing different characters in place of all of 'ä' and 'å'. Everything else works fine except this XML tag output is different.
String linkMsg = "Vänligen Klicka på länken";
byte[] bytes = linkMsg .getBytes(StandardCharsets.UTF_8);
String newString = new String(bytes,"UTF-8");
LinkField linkField = new LinkField();
linkField.setStringValue(newString);
The output i get is "Vänligen Klicka på länken".
Part of Maven pom,
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<start-class>se.link.Application</start-class>
<java.version>1.8</java.version>
</properties>
I am Using Spring Boot.
linkMsg.getBytes() gives you the string encoded in the platform default character set - which is probably not UTF-8. You are then treating this as though it was UTF-8.
Use linkMsg.getBytes(StandardCharsets.UTF_8) to get the string bytes encoded in UTF-8.

#Resource Injection in jar in 'lib' of an ear; why doesn't that work?

I have simple ear (GF 4.0, JDK 7; sticking with EE6 for now)
The ear contains:
EJBJar
WAR
lib/Shared.jar
Shared has an #Qualifier (#UserDS) in it (it also has META-INF/beans.xml).
I have an #Producer like this:
package fhw.producers;
import fhw.qaulifiers.ListingDS;
import fhw.qaulifiers.UserDS;
import javax.annotation.Resource;
import javax.sql.DataSource;
import javax.enterprise.inject.Default;
import javax.enterprise.inject.Produces;
#Default
public class DataSourceProducer
{
#Resource(lookup = "Member")
private DataSource userDS;
public DataSourceProducer()
{
System.err.println("DataSourceProducer.DataSourceProducer -- CONSTRUCTION");
}
#Produces
#UserDS
public DataSource getUserDataSource()
{
System.err.println("******DataSourceProducer.getUserDataSource; am I null? " + (null == userDS) ) ;
return userDS;
}
}
I have a simple EJB (it has a beans.xml) that uses it via:
#Inject
#UserDS
private DataSource userDS;
QUESTION: When I put DataSourceProducer in the EJBJAR and deploy; my print statements come out and my #Resource resolves and everything is fine. When I put DataSourceProducer in the Shared.jar; the print statements still come out but the #Resource didn't work and the EJB NPE's on the null DS returned by producer method etc. In both tests the qualifier stayed in the Shared.jar. I have no DDs anyway where (well a web.xml for the war -- all else is implicit)
Part of me thinks this makes a bit of sense; #Resource is sort of EE oriented (or no?); and should mostly make sense within a EE deployable.
OTOH, why can't have I have hand-full of qualifiers and some producers in a Shared JAR in lib dir of an EAR that all EJBJars and WARs (in the EAR) can use?
Is there a way to make this work?
If you really want -- you can see an entire example here: https://github.com/fwelland/ResJect
I got the same issue on GF3, but the solution seems the same.
Remove the dependency from the lib directory and add it to the root of the ear.
Then add the following to the application.xml
<module><ejb>Shared.jar</ejb></module>
Tip: using maven-ear-plugin you can automatically add dependencies as modules to your ear
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-ear-plugin</artifactId>
<configuration>
<displayName>...</displayName>
<application-name>...</application-name>
<defaultJavaBundleDir>lib</defaultJavaBundleDir>
<!-- not generate application.xml! we include it ourselves -->
<generateApplicationXml>false</generateApplicationXml>
<modules>
<ejbModule>
<groupId>...</groupId>
<artifactId>Shared</artifactId>
<bundleFileName>Shared.jar</bundleFileName>
</ejbModule>
</modules>
</configuration>
</plugin>
Note that if you're using GlassFish 4, you're using Java EE 7, not Java EE 6. In order for your situation to work, you need to register your shared jar as a module in the application.xml, so that it knows to scan it.

Resources