I'm playing with Apache Tika (1.13) and noticed that the language tag not is included for any of the documents that I run through tika-app --metadata.
What is the proper way to include/force language detection for all documents? Is it possible to do though configuration or may be I have to add a new parser adding this meta data, or override an existing parser in the chain?
Thanks!
Related
Is it possible to use Apache Tika to extract text, modify the extraction and then inject it back to the original document?
I imagine it could be possible by modifying the parser code so it could be run again and reinsert text instead of extracting it, but is there any feature already to do this?
Useful for document translation.
I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.
I'm trying to join a bunch of pages that are in different languages to a single page whith multiple alternative page languages.
This way, instead of having 3 Home pages, each one with its own language, I have 1 with several alternative page languages. So I'll have one page but the content is in different languages, depending on the language record it uses.
The issue is that TYPO3 extensions should behave differently depending on the language, i.e: form fields should be translated.
For that I was thinking on having a local storage folder for each page language record in order to hold the extension configurations. Chinese language would have a separated storage folder from the english version and the extension running for the chinese version would use the correct storage folder.
But how can I specify which storage folder the extension in the chinese language record should use if I don't use a new page?
Because if I use a language record to differentiate chinese from english I can't have different typoscript configurations. The language record properties page doesn't have a ts config field and as such I can't tell that the extension should use a different storage folder (different pid) for this language.
For each language you can add a cObj plugin and thus edit the plugin configuration. You can also use a condition and getText to assign a new pid to the plugin. For example plugin.test_pi1.sysfolder < 666.
The first option is when you look in the mysql table language overlay you have a separate record ctype plugin of your plugin and thus you can edit the plugin configuration.
Read the contents of an local XML file in an application and get the whole contents of xml file into a string for blackberry application?
To create a string from a local file see this blackberry forum entry: Open txt file from mediacard
Assuming you want to use the data within the XML, I would recommend using a XML parser rather than string manipulation. The following links should get you going with XML parsers and explain some of the trade-offs:
Blackberry How To - Use the XML Parser
Parsing XML in J2ME
Add XML parsing to your J2ME applications
If, however, you have any say about the format used JSON might be a good alternative. JSON is easy for machines to parse (thus using fewer resources) and it's human readable.
I have found using a SAXParser and subclassing DefaultHandler has worked well. Allows to go element by element.
Which is the best way of implementing a language translation for mutli -lang site.
I have two methods, i.e.
1) Just put all the static text contents in the pages to database ie a database driven programing style
$translation('register_form');
<form >
<input name="search" >
</form>
some sort of transltion.
2) is in like wordpress .mo files method.
I want to know more about that kind of translation , how it works with .mo like files...
or any other better way to make a translation module.
For .mo file support, check out the gettext functions, and look at the original GNU gettext documentation.
In a nutshell, anytime you use natural language, you wrap in a call to _() (an alias for gettext)
echo _("Hello world");
Once all your sources are doing this, run the GNU Xgettext tool to build a .pot file, which forms the "template" for your language translations. For each desired language, you'd merge this .pot file into an individual .po file for each language. The merge operation, again provided by xgettext, ensures you update the file with new strings, without losing any existing translations.
These .po files get sent to your translators, who might use a tool like poedit to provide the translation. When the .po file comes back, you can generate the binary .mo file which is used by gettext.
There are also classes in the Zend Framework which provide more facilities beyond this too, e.g. more file formats, and detection and selection of user preferred language in HTTP headers.
Look into gettext - it's not the perfect solution to everyone's problem, but it might be for yours.