Could someone share how to delete a paragraph form a textbox - textbox

I am currently working on a project to manipulate Docx file with the Apache POI project. I have used the api to remove text from a run inside of a text box, but cannot figure out how to remove a paragraph inside a text box. I assume that I need to use the class CTP to obtain the paragraph object to remove. Any examples or suggestion would be greatly appreciated.

In Replace text in text box of docx by using Apache POI I have shown how to replace text in Word text-box-contents. The approach is getting a list of XML text run elements from the XPath .//*/w:txbxContent/w:p/w:r using a XmlCursor which selects that path from /word/document.xml.
The same of course can be done using the path .//*/w:txbxContent/w:p, which gets the text paragraphs in text-box-contents. Having those low level paragraph XML, we can converting them into XWPFParagraphs to get the plain text out of them. Then, if the plain text contains some criterion, we can simply removing the paragraph's XML.
Source:
Code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import java.util.List;
import java.util.ArrayList;
public class WordRemoveParagraphInTextBox {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordRemoveParagraphInTextBox.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
XmlCursor cursor = paragraph.getCTP().newCursor();
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent/w:p");
List<XmlObject> ctpsintxtbx = new ArrayList<XmlObject>();
while(cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctpsintxtbx.add(obj);
}
for (XmlObject obj : ctpsintxtbx) {
CTP ctp = CTP.Factory.parse(obj.xmlText());
//CTP ctp = CTP.Factory.parse(obj.newInputStream());
XWPFParagraph bufferparagraph = new XWPFParagraph(ctp, document);
String text = bufferparagraph.getText();
if (text != null && text.contains("remove")) {
obj.newCursor().removeXml();
}
}
}
FileOutputStream out = new FileOutputStream("WordRemoveParagraphInTextBoxNew.docx");
document.write(out);
out.close();
document.close();
}
}
Result:

Related

Can we do image processing with Palantir Foundry?

I'm exploring the Palantir Foundry platform and it seems to have ton of options for rectangular data or structured data. Does anyone have experience of working with Unstructured Big data on Foundry platform? How can we use Foundry for Image analysis?
Although most examples are given using tabular data, in reality a lot of use case are using foundry for both unstructured and semi-structured data processing.
You should think of a dataset as a container of files with an API to access and process the files.
using the file level API you can get access to the files in the dataset and process them as you like. If these files are images you can extract information from the file and use it as you like.
a common use case is to have PDFs as files in a dataset and to extract information from the PDF and store it as tabular information so you can do both structured and unstructured search over it.
here is an example of file access to extract PDF:
import com.palantir.transforms.lang.java.api.Compute;
import com.palantir.transforms.lang.java.api.FoundryInput;
import com.palantir.transforms.lang.java.api.FoundryOutput;
import com.palantir.transforms.lang.java.api.Input;
import com.palantir.transforms.lang.java.api.Output;
import com.palantir.util.syntacticpath.Paths;
import com.google.common.collect.AbstractIterator;
import com.palantir.spark.binarystream.data.PortableFile;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.UUID;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public final class ExtractPDFText {
private static String pdf_source_files_rid = "SOME RID";
private static String dataProxyPath = "/foundry-data-proxy/api/dataproxy/datasets/";
private static String datasetViewPath = "/views/master/";
#Compute
public void compute(
#Input("/Base/project_name/treasury_pdf_docs") FoundryInput pdfFiles,
#Output("/Base/project_name/clean/pdf_text_extracted") FoundryOutput output) throws IOException {
Dataset<PortableFile> filesDataset = pdfFiles.asFiles().getFileSystem().filesAsDataset();
Dataset<String> mappedDataset = filesDataset.flatMap((FlatMapFunction<PortableFile, String>) portableFile ->
portableFile.convertToIterator(inputStream -> {
String pdfFileName = portableFile.getLogicalPath().getFileName().toString();
return new PDFIterator(inputStream, pdfFileName);
}), Encoders.STRING());
Dataset<Row> dataset = filesDataset
.sparkSession()
.read()
.option("inferSchema", "false")
.json(mappedDataset);
output.getDataFrameWriter(dataset).write();
}
private static final class PDFIterator extends AbstractIterator<String> {
private InputStream inputStream;
private String pdfFileName;
private boolean done;
PDFIterator(InputStream inputStream, String pdfFileName) throws IOException {
this.inputStream = inputStream;
this.pdfFileName = pdfFileName;
this.done = false;
}
#Override
protected String computeNext() {
if (done) {
return endOfData();
}
try {
String objectId = pdfFileName;
String appUrl = dataProxyPath.concat(pdf_source_files_rid).concat(datasetViewPath).concat(pdfFileName);
PDDocument document = PDDocument.load(inputStream);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String strippedText = text.replace("\"", "'").replace("\\", "").replace("“", "'").replace("”", "'").replace("\n", "").replace("\r", "");
done = true;
return "{\"id\": \"" + String.valueOf(UUID.randomUUID()) + "\", \"file_name\": \"" + pdfFileName + "\", \"app_url\": \"" + appUrl + "\", \"object_id\": \"" + objectId + "\", \"text\": \"" + strippedText + "\"}\n";
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
}
Indeed you can do image analysis on Foundry as you have access to files and can use arbitrary libraries (for example pillow or skimage for python). This can be done at scale as well as it can be parallelised.
A simple python snippet to stitch two pictures together should get you started:
from transforms.api import transform, Input, Output
from PIL import Image
#transform(
output=Output("/processed/stitched_images"),
raw=Input("/raw/images"),
image_meta=Input("/processed/image_meta")
)
def my_compute_function(raw, image_meta, output, ctx):
image_meta = image_meta.dataframe()
def stitch_images(clone):
left = clone["left_file_name"]
right = clone["right_file_name"]
image_name = clone["image_name"]
with raw.filesystem().open(left, mode="rb") as left_file:
with raw.filesystem().open(right, mode="rb") as right_file:
with output.filesystem().open(image_name, 'wb') as out_file:
left_image = Image.open(left_file)
right_image = Image.open(right_file)
(width, height) = left_image.size
result_width = width * 2
result_height = height
result = Image.new('RGB', (result_width, result_height))
result.paste(im=left_image, box=(0, 0))
result.paste(im=right_image, box=(height, 0))
result.save(out_file, format='jpeg', quality=90)
image_meta.rdd.foreach(stitch_images)
The image_meta dataset is just a dataset that has 2 file names per row. To extract filenames from a dataset of raw files you can use something like:
#transform(
output=Output("/processed/image_meta"),
raw=Input("/raw/images"),
)
def my_compute_function(raw, output, ctx):
file_names = [(file_status.path, 1) for file_status in raw.filesystem().ls(glob="*.jpg")]
# create and write spark dataframe based on array
As others have mentioned Palantir-Foundry’s focus is on tabular data and doesn’t currently provide GPU or other tensor processing unit access. So doing anything intense like an FFT transform or Deep Learning would be ill advised at best if not downright impossible at worst.
That being said you can upload image files into dataset nodes for read/write access. You could also store their binary information as a blob type into a Dataframe in order to store files in a given record field. Given that there are a multitude of Python image processing adjacent and matrix math libraries available on the platform and given that it’s also possible to upload library packages to the platform manually through the Code Repo app, it is conceivable that someone could make use of simple manipulations on a somewhat large scale as long as it wasn’t overly complex or memory intensive.
Note that Foundry seems to have no GPU support at the moment, so if you are thinking about running deep learning based image processing, this will be quite slow on CPUs.

Can anyone help me to get a value from .XML

I've executed my code in selenium-java and i need a value of Identifiers(2nd line, not the 4th one) from .XML which is opened in MS edge browser.
My XML:
<Application Type="ABCD">
<Identifier>18753</Identifier>
<SalesChannel SalesChannelType="PQRS" SalesChannelSegment="XYZ">
<Identifier>AB1234</Identifier>
Can anyone help me with the code to get a values(18753) which is between Identifier 2nd line.
Note: I've a code which is working fine for chrome & FF, but bot able to work for MSedge:
Assert.assertTrue(driver.getPageSource().contains("Identifier"));
String xml = driver.getPageSource();
String appID = xml.split("<Identifier>")[0].split("</Identifier>")[1];
I think there's a mistake in your split function. The right code should be like this:
String appID = xml.split("</Identifier>")[0].split("<Identifier>")[1];
Sample code:
import org.junit.Assert;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.edge.EdgeDriver;
import org.openqa.selenium.edge.EdgeOptions;
public class Edgeauto {
public static void main(String[] args) {
System.setProperty("webdriver.edge.driver", "your_path_of_webdriver\\msedgedriver.exe");
EdgeOptions edgeOptions = new EdgeOptions();
WebDriver driver = new EdgeDriver(edgeOptions);
driver.get("http://xxx.xml");
Assert.assertTrue(driver.getPageSource().contains("Identifier"));
String xml = driver.getPageSource();
String appID = xml.split("</Identifier>")[0].split("<Identifier>")[1];
System.out.println(appID);
}
}
Result:

Vaadin TextField that will only default to numeric keyboard on iOS

Do you know of a way to setup a TextField so that the numerical keyboard is shown instead of the regular one (cfr. a type attribute "number" in the input element)? Users are finding it annoying to always have to switch to the numeric keyboard for certain fields (these have to be filled out several hundred times per day!). Most related posts pertain to restricting the input to numbers which is not a problem.
Thanks,
William
For Vaadin:
Simply using Number Field
Number Field: Mobile browser shows dedicated input controls. Decrease and increase buttons for the value can be shown optionally.
NumberField dollarField = new NumberField("Dollars");
See the documentation example.
For Native iOS:
Simply set the keyboard type to NumberPad:
self.someTextField.keyboardType = UIKeyboardType.NumberPad
See the documentation for all the keyboard types.
There is a closed github issue about this here, but I'm not sure how the there mentioned slotting should work now. If this is now already possible without the workaround below, please feel free to let me know.
As far as I can tell, adding the attribute type="number" to the <vaadin-text-field> does not work, because this attribute should be on the actual <input> element within.
There is a workaround to do this: https://github.com/Artur-/vaadin-examples/blob/master/example-textfield-type/src/main/java/org/vaadin/artur/MainView.java#L42
TextField textfield = new TextField("Number Input");
textField.getElement().getNode().runWhenAttached(ui -> {
ui.getPage().executeJavaScript("$0.focusElement.type=$1", textField, "number");
});
In lieu of built-in components / add-ons, I found a lightweight solution in one of the samples from the archetype-application-example GitHub project.
In summary (only using relevant parts):
NumberTypeField.java:
package com.example;
import com.vaadin.data.util.converter.StringToIntegerConverter;
import com.vaadin.ui.TextField;
/**
* A field for entering numbers. On touch devices, a numeric keyboard is shown
* instead of the normal one.
*/
public class NumberTypeField extends TextField {
public NumberTypeField() {
// Mark the field as numeric.
// This affects the virtual keyboard shown on mobile devices.
AttributeExtension ae = new AttributeExtension();
ae.extend(this);
ae.setAttribute("type", "number");
}
public NumberTypeField(String caption) {
this();
setCaption(caption);
}
}
AttributeExtension.java:
package com.example;
import com.vaadin.annotations.JavaScript;
import com.vaadin.server.AbstractJavaScriptExtension;
import com.vaadin.ui.TextField;
/**
* A JavaScript extension for adding arbitrary HTML attributes for components.
*/
#JavaScript("attribute_extension_connector.js")
public class AttributeExtension extends AbstractJavaScriptExtension {
private static final long serialVersionUID = 1L;
public void extend(TextField target) {
super.extend(target);
}
#Override
protected AttributeExtensionState getState() {
return (AttributeExtensionState) super.getState();
}
public void setAttribute(String attribute, String value) {
getState().attributes.put(attribute, value);
}
AttributeExtensionState.java:
package com.example;
import com.vaadin.shared.JavaScriptExtensionState;
import java.util.HashMap;
/**
* Shared state class for {#link AttributeExtension} communication from server
* to client.
*/
public class AttributeExtensionState extends JavaScriptExtensionState {
private static final long serialVersionUID = 1L;
public HashMap<String, String> attributes = new HashMap<String, String>();
}
attribute_extension_connector.js (put in same source folder, e.g., com.example):
window.com_example_AttributeExtension = function() {
this.onStateChange = function() {
var element = this.getElement(this.getParentId());
if (element) {
var attributes = this.getState().attributes;
for (var attr in attributes) {
if (attributes.hasOwnProperty(attr)) {
element.setAttribute(attr, attributes[attr]);
}
}
}
}
}

how to add header "UNH" to UNEdifactInterchange41 Object in smooks

I have to create an mscons export of energy values. I created a bit of code from some examples I found, but now I stuck. MSCONS needs an UNB and an UNH header.
I can add the UNB header to the UNEdifactInterchange41 object, but I don't find a method to attach the UNH header.
Here's my code so far:
import org.milyn.SmooksException;
import org.milyn.edi.unedifact.d16b.D16BInterchangeFactory;
import org.milyn.edi.unedifact.d16b.MSCONS.*;
import org.milyn.smooks.edi.unedifact.model.r41.*;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.io.StringWriter;
import org.milyn.smooks.edi.unedifact.model.r41.types.MessageIdentifier;
import org.milyn.smooks.edi.unedifact.model.r41.types.Party;
import org.milyn.smooks.edi.unedifact.model.r41.types.SyntaxIdentifier;
public class EDI {
public static void main(String[] args) throws IOException, SAXException, SmooksException {
D16BInterchangeFactory factory = D16BInterchangeFactory.getInstance();
UNEdifactInterchange41 edi = new UNEdifactInterchange41();
Mscons mscons = new Mscons();
/*UNB*/
UNB41 unb = new UNB41();
unb.setSender(null);
Party sender = new Party();
sender.setInternalId(getSenderInternalId());
sender.setCodeQualifier(getSenderCodeQualifier());
sender.setId(getSenderId());
SyntaxIdentifier si=new SyntaxIdentifier();
si.setVersionNum("3");
si.setId("UNOC");
unb.setSyntaxIdentifier(si);
unb.setSender(sender);
edi.setInterchangeHeader(unb);
/*UNH*/
UNH41 unh = new UNH41();
MessageIdentifier mi=new MessageIdentifier();
mi.setTypeSubFunctionId("MSCONS");
mi.setControllingAgencyCode("UN");
mi.setAssociationAssignedCode("2.2h");
String refno=createRefNo();
unh.setMessageIdentifier(mi);
/* How to attach UNH? */
}
}
Sounds like you got it almost right, you need to attach the UNH to message and not the opposite:
mi.setMessageIdentifier(unh);
You have an example there if you need:
https://github.com/ClaudePlos/VOrders/blob/master/src/main/java/pl/vo/integration/edifact/EdifactExportPricat.java

Often CSV files use a tab delimiter, how can the Univocity Parsers .csv parser be configured to allow a tab delimiter?

Often CSV files use a tab delimiter, how can Univocity Parsers be configured so that the following can use a tab delimiter?:
CsvParserSettings parserSettings = new CsvParserSettings();
When parsing .csv files delimited by tabs is required, although Univocity Parsers has a TSVreader, having more than one settings instance creates coding obstacles.
The code and stack trace are below.
Any help would be greatly appreciated.
import com.univocity.parsers.csv.CsvParserSettings;
import com.univocity.parsers.common.processor.*;
import com.univocity.parsers.csv.*;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.lang.IllegalStateException;
import java.lang.String;
import java.util.List;
public class UnivocityParsers {
public Reader getReader(String relativePath) {
try {
return new InputStreamReader(this.getClass().getResourceAsStream(relativePath), "Windows-1252");
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("Unable to read input", e);
}
}
public void columnSelection() {
RowListProcessor rowProcessor = new RowListProcessor();
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.getFormat().setDelimiter('\t');
// Here we select only the columns "Price", "Year" and "Make".
// The parser just skips the other fields
parserSettings.selectFields("AUTHOR", "ISBN");
CsvParser parser = new CsvParser(parserSettings);
parser.parse(getReader("list4.csv"));
List<String[]> rows = rowProcessor.getRows();
String[] strings = rows.get(0);
System.out.print(strings[0]);
}
public static void main(String arg[]) {
UnivocityParsers univocityParsers = new UnivocityParsers();
univocityParsers.columnSelection();
}
}
Stack trace:
Exception in thread "JavaFX Application Thread" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:97)
at parse.Controller.getReader(Controller.java:34)
at parse.Controller.columnSelection(Controller.java:107)
... 56 more
Here is the file being parsed:
"REVIEW_DATE" "AUTHOR" "ISBN" "DISCOUNTED_PRICE"
"1985/01/21" "Douglas Adams" 345391802 5.95
"1990/01/12" "Douglas Hofstadter" 465026567 9.95
"1998/07/15" "Timothy ""The Parser"" Campbell" 968411304 18.99
"1999/12/03" "Richard Friedman" 60630353 5.95
"2001/09/19" "Karen Armstrong" 345384563 9.95
"2002/06/23" "David Jones" 198504691 9.95
"2002/06/23" "Julian Jaynes" 618057072 12.5
"2003/09/30" "Scott Adams" 740721909 4.95
"2004/10/04" "Benjamin Radcliff" 804818088 4.95
"2004/10/04" "Randel Helms" 879755725 4.5
The problem comes from the getReader method. It is not finding the file in your classpath.
This line is producing a null:
this.getClass().getResourceAsStream(relativePath)
Maybe you should use this (note the leading slash on the file name):
parser.parse(getReader("/list4.csv"));
Also note that the TSV parser is a different implementation. TSV is not just CSV with tab delimiters (it's all good if in your case it works). Just keep in mind trying to read a TSV using a CSV parser is a bad idea as characters such as '\n' or '\t' may be escaped as actual sequences of '\' and 'n'. When a CSV parser reads this you will get the 2 characters ('\' + 'n') instead of the new line character ('\n')

Resources