Pdf Parsing Challenge

Pdf Parsing Challenge - parsing

I have the following problem: I have a lot of papers in pdf format and I have to extract information from the first page of each one and then save it into a database
I just need to extract, the title, the abstract, keywords, authors list, universities list, emails. I want to do a script to get a string for each one of that fields, for each paper.
How can I do that? Does anyone already did that? What languages and tools do you recommend me?
and Does exist a paper repository that already do that database feeding?
Considering the pdfs could be with different encodings, I have to deal with this problem too. Any help with this would be great.
An example of a paper its here
Greetings!

http://pdfbox.apache.org/
You have to check about the security of the pdf, that it's really text and not an image. Check the command line application of pdfbox if it works extracting the text, then you can use the jar and use http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/ExtractTextByArea.html
Hope it helps....
By the way it's java...
edit.
I have not used this as a jar library http://www.qoppa.com/pdftext/, but I used the example application and it works, but I decided to go with pdfbox...

You need a API to read your pdf.
Seems fine (I never try it though)
You can probably find others with this link :-)

Related

Custom Report Printing in Oracle Apex

What is the best way to create a Custom Report Template to print in Oracle Apex ? I saw some posts that've already been answered, but since they were Apex 5.1, I was wondering if these were still up to date, or if there are easier way no (I am using Apex v21.1). Also, the "Printing" attribute in the Reports does not give me the possiblities to do these specific things :
I would like the users to print an Interactive Report, which will display the logo of the company, the export date, and the data obviously. Is it possible to set custom margin so the list take more space on the page, and to set a custom size for the column, in case I have a column with a long text in it ?
Thanks in advance,
Thomas

Welcome to one of the weakest points of Oracle APEX, printing.
Honestly, the best option is Apex Office Print(AOP), but they are a paid plugin.
They enable lots of different printing, quite easy to grasp, and I am quite satisfied with them.
Other options I have seen are:
Make an excel sheet from within the database and you can generate that dynamically(you can also expand fields, colour them, probably can also put an image in there but I havent tried that).
I once decided to torture myself and I tried printing through HTML, as in I created an HTML document with the data I wanted(I made an invoice), but that has many problems, chief among them being page breaks.
Another option that was recommended to me, but that I have not yet tried was setting up an Apache FOP, having the Oracle database generate an XML, send it there and get back a nice looking pdf(http://www.apex-reports.com/videos.html).
I hope you get something working, and if you try this Apache FOP approach please let me know how it goes.

Clean URL links from table of contents in MS Word

I don't know how appropriate it is to ask a question about MS Word here but I'll do so anyway..
I have a word document and I am exporting it to html, I have a table of contents, with links to the appropriate headings. The issue is that when I click on a link, it gives me something like a #Toc_81682617 in the url. Is it possible to instead have a #Summary or #1.1Summary
Thanks in advance to any enlightened souls out there

This should be in a comment but I am not allowed to comment because too new.
This would be better posted in the Word Answers forum hosted by Microsoft. This forum is about programming.
That said, as far as I know, this is the nature of automatic bookmarks generated by Word. I know of no way of automatically modifying them.

Printing a ticket in xpages

In a xpages application I need to mount a label with a certain layout, analogous to the layout of a ticket. Searching, I have verified that the most used practice is to use openoffice to design the odt model and in java to use bilbiotec to JOD Reports. Do you advise to follow this line yourself, or do you have any suggestions?

I would concur with Marcus. The way forward is PDF output. There are a couple of ways to do this, depending on your constraints.
When user must design every aspect of the ticket using openoffice is a suitable approach, however you need a headless openoffice install for the rendering
If everything can be code, then PDFBox is a good way to go. Wrap your code into a managed bean
The middle path would be XSL:FO and Apache FOP. It allows alteration of the layout by providing a different style sheet. I wrote an article series outlining that approach.
Let us know what works for you!

There is also the POI4XPages plugin. You could design your form with Word and then use placeholders to populate the document and output as a pdf.
See https://poi4xpages.openntf.org/main.nsf/project.xsp?r=project/POI%204%20XPages/releases/E80C4FC9FB07E1E4852580E3006E02C7
Download the latest version (1.4) at http://p2.openntf.org/repository.nsf/home.xsp/poi4xpages/snapshots
Howard

I was able to solve my problem, because I discovered that here in the company there is the abcpdf software. Through a web service that uses the APis of this software, I pass the html code of the ticket and the web service returns the pdf document in an array of bytes. I created a managed javabean to consume the web service and display the pdf in the browser.
Thanks to all who have contributed in some way with suggestions.

Resume parser in Ruby/(Rails Plugin/Gem)

Is there any ruby gem/ rails plugin available for parsing the resume and importing that information into an object/form ?

I may be wrong, but I don't think you'll find anything completely automated to do this, because a résumé (or CV) can be structured in so many different ways and can contain very different types of data. Any completely automated solution is likely to have accuracy problems, since it is technically a difficult problem to solve.
You may find this answer useful.
Here are some other suggestions that might help :-
Require a user to enter their details into a form on your website instead of uploading a Word document. You'll then be able to explicitly ask for the data you want and you'll be able to store the data in a structure that suits you. However, this may be too much of a barrier to entry for your users.
Allow a user to submit the URL of their résumé published using the hResume microformat. Sites like LinkedIn already publish résumés in this format. There is a Ruby gem mofo which can parse microformats including hResumes. However, not all users will have an on-line résumé like this.

How to get rid of stupid "pad" labels produced by RTML functions?

I am unlucky to be in charge of maintaining some old Yahoo! Store built using their RTML-based platform.
Recently I've noticed that HTML code generated by some RTML functions is sprinkled all over with "padding images" (or whatever is the conventional name for those 1x1 pixel images used to enforce layout). I have nothing against using such images, but... all those images are supplied with an ALT attribute like this:
<img href="http://.../image1x1.gif" alt="pad">
With all due respect to the original authors of RTML, but they must have been smoking something when they came up with this "accessibility enhancement"... :-(
Anyway, here are my questions:
Does anybody know a list of all RTML functions that generate HTML with all these "pad" images?
Is there any way to get rid of all those alt="pad" attributes without rewriting a lot of RTML code?
NB: This may sound a little cynical, but improved accessibility is not the main goal here. The main goal is to stop exposing those moronic alt="pad" attributes to Google and other smart search engines. So client-side scripting is not going to help, as far as I know.
Thank you!
P.S. Probably, most of you are really lucky and never heard of RTML. Because if somebody would establish a prize for software products based on
commercial success
------------------
usability
ratio, this RTML-based "platform" would probably win the first place.
P.P.S. Apparently someone from Yahoo! finally listened, because I can no longer find those silly "pad" tags in the RTML generated for our store. Nevertheless, one of the ideas offered in response to my original question does provide a very practical solution - not just to the original problem but to any similar problem with RTML platform. See the winning answer - it's really good.

The only way I see is to have your own website front-end that will filter whatever you want from the RTML site....
for example, your rtml site is at http://rtmlusglysite.yahoo.com/store/XYZ01134 , you could host a simple PHP front-end at http:://www.example.com that would be acting like a "filtering" HTTP web proxy, so http://rtmlusglysite.yahoo.com/store/XYZ01134/item1234.rtml would be accessed by http://www.example.com/item1234.html
It's not an ideal solution, but it should work, and you could do some more fancy stuff.

Nice try from the other posters, but there is a very simple RTML command that will do it. . .
TEXT PAT-SUBST s GRAB
MULTI
HEAD
BODY
TEXT #var-with-alt-tag-equals-pad-in-it
frompat "alt=\"pad\""
topat ""
The above RTML will find all instances of alt="pad" and replace it with nothing.

Well you're right on RTML being relatively untraveled :)
Do you have a way to add your own attributes to these images tags? If so, would it be possible to override the alt attribute? If you specify alt="", I would think that would override Yahoo's... Otherwise consider putting a useful alt tag in there for the blind and dialup types.

It's the first time I'm hearing about this platform, but here is an idea: if you can add javascript to the pages, you could write a function that will run after the page has loaded and remove all the alt="pad" attributes from the page.
Unfortunately this solutions works only with browsers that know about scripting, so lynx or some other text based browsers might not support it.

I have shared a link official RTML guide from yahoo. Hope it will help. Thanks!
List of available RTML books and resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart