pdfminer has no attribute 'get_pages' and no doc.set_parser(parser) and doc.initialize('') - parsing

I'm trying to read a pdf file into structured pdfminer elements, I'm using this code but it won't go through as 'get_pages', doc.set_parser(parser) and doc.initialize('') aren't recognizable (I've imported all possible elements from pdfminer library and I'm using python 3.8 -Spyder).
fp = open(r"C:\Users\My files\Python\myfile.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
doc.set_parser(parser) # Not compatible with current version of PDFMiner
doc.initialize('') # Not compatible with current version of PDFMiner
parser.set_document(doc)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages(): #'PDFDocument' object has no attribute 'get_pages'
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
print(lt_obj.get_text())
fp.close()

Related

R/exams unicode char in *.Rnw question files are not propoerly displayed: é displayed as <U+00E9> in final PDF

I am struggling to produce an exam sheet in French using exams2nops. There are accents in the text provided in the intro and title argument of this function and also in the Rnw files containing the function. The formers are correctly displayed in the resulting PDF, but not the later, for example é from a Rnw file is displayed as <U+00E9>.
The call to exams2nops looks like this:
exams2nops(file=myexam, n = N.students, dir = '.',
name = paste0('exam-', exam.date),
title = "Examen écrit",
course = course.id,
institution = "",
logo = paste(exams.dir, 'input/logo.jpg', sep='/'),
date = exam.date,
replacement = TRUE,
intro = intro,
blank=round(length(myexam)/4),
duplex = TRUE, pages = NULL,
usepackage = NULL,
language = "fr",
encoding = "UTF-8",
startid = 1,
points = c(1), showpoints = TRUE,
samepage = TRUE,
twocolumn = FALSE,
reglength = 9,
header=NULL)
Note that "Examen écrit" is correctly displayed in the final PDF, the problem is with the accent in the Rnw files. The function call yields no error.
The *.tex files by generated by exams2nops, already have the problem. For example, the sentense 'Quarante patients ont été inscrits' in the original Rnw file, becomes 'Quarante patients ont <U+00E9>t<U+00E9> inscrits' in the tex file.
I use exams_2.4-0 with R 4.2.2 with TeXShop 4.70 on OSX 11.6.
I checked that Rnw are utf-8 encoded, for example:
$ file -I question1.Rnw
question1.Rnw: text/x-tex; charset=utf-8
It seems they are utf-8-encoded, indeed. These files were translated with deepl or google translate, then edited in emacs.
I tried setting the encoding parameter of exams2nops to latin-1. It did not help.
Any Idea?
The problem disapeared after setting R 'locales' properly. A recurrent problem with OSX R installs. The symptome is:
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
at start up. This thread explains how to fix it: Installing R on Mac - Warning messages: Setting LC_CTYPE failed, using "C".
I'm collecting a few further comments here in addition to the existing answer:
The only encoding (beyond ASCII) supported by R/exams, starting from version 2.4-0, is UTF-8. Support for other encodings like latin1 etc. has been discontinued.
As only UTF-8 is supported the encoding does not have to be specified in R/exams function calls anymore (as still might be advised in older tutorials).
To leverage this support of UTF-8, R has to be configured with a suitable locale. A "C" locate (see the answer by #vdet) is not sufficient.
When using R/LaTeX (Rnw) exercises all issues with encodings can also be avoided entirely by using LaTeX commands for special characters, e.g., {\'e}t{\'e} instead of été. The latter is of course more convenient but the former can be more robust, especially when working with teams of instructors living on different operating systems with different locale settings.
When using LaTeX commands instead of special characters in R strings (as opposed to the exercise files), then remember that the backslash has to be escaped. For example, the argument title = "Examen écrit" becomes title = "Examen {\\'e}crit".

Ruby on Rails - How to convert to images some elements from a word document

Context
In our platform we allow users to upload word documents, those documents are stored in google drive and then dowloaded again to our platform in HTML format to create a section where the users can interact with that content.
Rails 5.0.7
Ruby 2.5.7p206
selenium-webdriver 3.142.7 (latest stable version compatible with our ruby and rails versions)
Problem
Some of the documents have charts or graphics inside that are not processed correctly giving wrong results after all the process.
We have been trying to fix this problem at the moment we get the word document and before to send it to google drive.
I'm looking for a simple way to export the entire chart and/or table as an image, if anyone knows of a way to do this the advice would be much appreciated.
Edit 1: Adding some screenshots:
This screenshot is from the original word doc:
And this is how it looks in our systems:
Here are the approaches I have tried that haven't worked for me so far.
Approach 1
Using nokogiri to read the document and found the nodes that contain the charts (we've found that they are called drawing) and then use Selenium to navigate through the file and take and screenshot of that particular section.
The problem we found with this approach is that the versions our gems are not compatible with the latest versions of selenium and its web drivers (chrome or firefox) and it is not posible to perform this action.
Other problem, and it seems is due to security, is that selenium is not able to browse inside local files and open it.
options = Selenium::WebDriver::Firefox::Options.new(binary: '/usr/bin/firefox', headless: true)
driver = Selenium::WebDriver.for :firefox, options: options
path = "#{Rails.root}/doc_file.docx"
driver.navigate.to("file://#{path}")
# Here occurs the first issue, it is not able to navigate to the file
puts "Title: #{driver.title}"
puts "URL: #{driver.current_url}"
# Below is the code that I am trying to use to replace the images with the modified images
drawing_elements = driver.find_elements(:css, 'w|drawing')
modified_paragraphs = []
drawing_elements.each do |drawing_element|
paragraph_element = drawing_element.find_element(:xpath, '..')
paragraph_element.screenshot.save('paragraph.png')
modified_paragraph = File.read('paragraph.png')
modified_paragraphs << modified_paragraph
end
driver.quit
file = File.open(File.join(Rails.root, 'doc_file.docx'))
doc = Nokogiri::XML(file)
drawing_elements = doc.css('w|drawing')
drawing_elements.each_with_index do |drawing_element, i|
paragraph_element = drawing_element.parent
paragraph_element.replace(modified_paragraphs[i])
end
new_doc_file = File.write('modified_doc.docx', doc.to_xml)
s3_client.put_object(bucket: bucket, key: #document_path, body: new_doc_file)
File.delete('doc_file.docx')
Approach 2
Using nokogiri to get the drawing elements and the try to convert it directly to an image using rmagick or mini_magick.
It is only possible if the drawing element actually contains an image, it can convert that correctly to an image, but the problem is when inside of the drawing element are not images but other elements like graphicData, pic, blipFill, blip. It needs to start looping into the element and rebuilding it, but at that point of time it seems that the element is malformed and it can't rebuild it.
Other issue with this approach is when it founds elements that seem to conform an svg file, it also needs to loop into all the elements and try to rebuild it, but the same as the above issue, it seems that the element is malformed.
response = s3_client.get_object(bucket: bucket, key: #document_path)
docx = response.body.read
Zip::File.open_buffer(docx) do |zip|
doc = zip.find_entry("word/document.xml")
doc_xml = doc.get_input_stream.read
doc = Nokogiri::XML(doc_xml)
drawing_elements = doc.xpath("//w:drawing")
drawing_elements.each do |drawing_element|
node = get_chil_by_name(drawing_element, "graphic")
if node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").any?
img_data = node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").first.attributes["r:embed"].value
img = Magick::Image.from_blob(img_data).first
img.write("node.jpeg")
node.replace("<img src='#{img.to_blob}'/>")
elsif node.xpath("//a:graphicData/a:svg").any?
svg_data = node.xpath("//a:graphicData/a:svg").to_s
Prawn::Document.generate("node.pdf") do |pdf|
pdf.svg svg_data, at: [0, pdf.cursor], width: pdf.bounds.width
end
else
puts "unsupported format"
end
end
# update the file in S3
s3.put_object(bucket: bucket, key: #document_path, body: doc)
end
Approach 3
Convert the elements since its parents to a pdf file and then to an image.
Basically the same issue as in the approach 2, it needs to loop inside all the elements and try to rebuild it, we haven't found a way to do that.

Generate files with one input to multiply outputs

I'm trying to create a code generator that takes input a JSON file and generates multiple classes in multiple files.
And my question is, is it possible to create multiple files for one input using build from dart lang?
Yes it is possible. There are currently many tools in available on pub.dev that have code generation. For creating a simple custom code generator, check out the package code_builder provided by the core Dart team.
You can use dart_style as well to format the output of the code_builder results.
Here is a simple example of the package in use (from the package's example):
import 'package:code_builder/code_builder.dart';
import 'package:dart_style/dart_style.dart';
final _dartfmt = DartFormatter();
// The string of the generated code for AnimalClass
String animalClass() {
final animal = Class((b) => b
..name = 'Animal'
..extend = refer('Organism')
..methods.add(Method.returnsVoid((b) => b
..name = 'eat'
..body = refer('print').call([literalString('Yum!')]).code)));
return _dartfmt.format('${animal.accept(DartEmitter())}');
}
In this example you can use the dart:io API to create a File and write the output from animalClass() (from the example) to the file:
final animalDart = File('animal.dart');
// write the new file to the disk
animalDart.createSync();
// write the contents of the class to the file
animalDart.writeAsStringSync(animalClass());
You can use the File API to read a .json from the path, then use jsonDecode on the contents of the file to access the contents of the JSON config.

difficulty using saxon in java code for .sch to .xsl conversion

I’m trying to use schematron validation using saxon.
Firstly, i want to compile .sch file into .xsl . Later , i want to validate an .xml file with firstly produced .xsl file.
I found command line usage of saxon like below. And i used successfully them.
But i need to make these actions with java code.
I tryed some codes like below , but i did not guess how to put sch extensined file as a parameter (edefter_yevmiye.sch) and iso_svrl_for_xslt2.xsl into the code.
I searched the internet but i did not find enough information.
Is there a sample java code for converting .sch to .xsl or could you guide me please?
My java code
**Compiling .sch to .xsl**
net.sf.saxon.s9api.Processor processor1 = new net.sf.saxon.s9api.Processor(false);
net.sf.saxon.s9api.XsltCompiler xsltCompiler1 = processor1.newXsltCompiler();
xsltCompiler1.setXsltLanguageVersion("2.0");
xsltCompiler1.setSchemaAware(true);
net.sf.saxon.s9api.XsltExecutable xsltExecutable1 = xsltCompiler1.compile(new StreamSource(new FileInputStream(new File("File1.xsl"))));
net.sf.saxon.s9api.XsltTransformer xsltTransformer1 = xsltExecutable1.load();
xsltTransformer1.setSource(new StreamSource(new FileInputStream(new
File("File2.sch"))));
**Validation**
net.sf.saxon.s9api.Processor processor2 = new net.sf.saxon.s9api.Processor(false);
net.sf.saxon.s9api.XsltCompiler xsltCompiler2 = processor2.newXsltCompiler();
xsltCompiler2.setXsltLanguageVersion("2.0");
xsltCompiler2.setSchemaAware(true);
net.sf.saxon.s9api.XsltExecutable xsltExecutable2 = xsltCompiler2.compile(new StreamSource(new
FileInputStream(new File(“File1.xslt"))));
net.sf.saxon.s9api.XsltTransformer xsltTransformer2 = xsltExecutable2.load();
xsltTransformer2.setSource(new StreamSource(new FileInputStream(new
File("src.xml"))));
net.sf.saxon.s9api.Destination dest2 = new Serializer(System.out);
xsltTransformer2.setDestination(dest2);
xsltTransformer1.setDestination(xsltTransformer2);
xsltTransformer1.transform();
Command line usage
Compiling:
java -jar saxon9he.jar -o:output.xsl -s:some.sch iso_svrl_for_xslt2.xsl
Validation:
java -jar saxon9he.jar -o:warnings.xml -s:some.xml output.xsl
You're using your second transformation as the destination for the first, but that means that the output of the first transformation is used as the source document for the second, whereas you want to use it, I think, as the stylesheet for the second transformation.
The simplest way to do this is probably to set an XdmDestination for the first transformation, and then with this destination object, do destination.getXdmNode().asSource() to get the input to the compile() method for the second transformation.

How do I disable assertions for the entire project?

I have some code in Delphi that has Assert statements all over it. I know that there is a compiler directive {$C-}, but there are too many units to add it to. Is there a way to have it done by the compiler command line or somewhere in the dpr file?
You can use $C- from the command line as well, or configure it in 'Project->Options->Compiler' from the IDE (which configures it in the .dproj file).
There's a list of command line switches and options available by typing dcc32 from the command line. It can be redirected to a text file using command redirection (as in dcc32 > dccCommands.txt), which produces the following output with XE5's version of dcc32:
Embarcadero Delphi for Win32 compiler version 26.0
Copyright (c) 1983,2013 Embarcadero Technologies, Inc.
Syntax: dcc32 [options] filename [options]
-A<unit>=<alias> = Set unit alias
-B = Build all units
-CC = Console target
-CG = GUI target
-D<syms> = Define conditionals
-E<path> = EXE/DLL output directory
-F<offset> = Find error
-GD = Detailed map file
-GP = Map file with publics
-GS = Map file with segments
-H = Output hint messages
-I<paths> = Include directories
-J = Generate .obj file
-JPHNE = Generate C++ .obj file, .hpp file, in namespace, export all
-JL = Generate package .lib, .bpi, and all .hpp files for C++
-K<addr> = Set image base addr
-LE<path> = package .bpl output directory
-LN<path> = package .dcp output directory
-LU<package> = Use package
-M = Make modified units
-NU<path> = unit .dcu output directory
-NH<path> = unit .hpp output directory
-NO<path> = unit .obj output directory
-NB<path> = unit .bpi output directory
-NX<path> = unit .xml output directory
-NS<namespaces> = Namespace search path
-O<paths> = Object directories
-P = look for 8.3 file names also
-Q = Quiet compile
-R<paths> = Resource directories
-TX<ext> = Output name extension
-U<paths> = Unit directories
-V = Debug information in EXE
-VR = Generate remote debug (RSM)
-VT = Debug information in TDS
-VN = TDS symbols in namespace
-W[+|-|^][warn_id] = Output warning messages
-Z = Output 'never build' DCPs
-$<dir> = Compiler directive
--help = Show this help screen
--version = Show name and version
--codepage:<cp> = specify source file encoding
--default-namespace:<namespace> = set namespace
--depends = output unit dependency information
--doc = output XML documentation
--drc = output resource string .drc file
--no-config = do not load default dcc32.cfg file
--description:<string> = set executable description
--inline:{on|off|auto} = function inlining control
--legacy-ifend = allow legacy $IFEND directive
--zero-based-strings[+|-] = strings are indexed starting at 0
--peflags:<flags> = set extra PE Header flags field
--peoptflags:<flags> = set extra PE Header optional flags field
--peosversion:<major>.<minor> = set OS Version fields in PE Header (default: 5.0)
--pesubsysversion:<major>.<minor> = set Subsystem Version fields in PE Header (default: 5.0)
--peuserversion:<major>.<minor> = set User Version fields in PE Header (default: 0.0)
Compiler switches: -$<letter><state> (defaults are shown below)
A8 Aligned record fields
B- Full boolean Evaluation
C+ Evaluate assertions at runtime
D+ Debug information
G+ Use imported data references
H+ Use long strings by default
I+ I/O checking
J- Writeable structured consts
L+ Local debug symbols
M- Runtime type info
O+ Optimization
P+ Open string params
Q- Integer overflow checking
R- Range checking
T- Typed # operator
U- Pentium(tm)-safe divide
V+ Strict var-strings
W- Generate stack frames
X+ Extended syntax
Y+ Symbol reference info
Z1 Minimum size of enum types
Stack size: -$M<minStackSize[,maxStackSize]> (default 16384,1048576)

Resources