I know that when we implement a ParDo transform, we pick up individual elements from our data(basically separated by "\n"). But what if I have an element that occupies two lines in my file. Can I apply my own condition to pick elements according to it? Or is it always necessary to have an element in a single line?
Reading of text files is controlled by TextIO, not by ParDo - I suppose that's what you meant. Indeed right now TextIO splits files into 1 element per line, however there is work in progress on changing that. You can follow the work at https://issues.apache.org/jira/browse/BEAM-2802.
It would be useful for that work, if you told more about your file format, to make sure it is in scope.
Related
[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.
When process html with template engin, we may get a large file which contains useless white spaces everywhere. To minimized the spaces the engin processed is an important thing to do to accelerate the transform speed.
There are two kinds of spaces which can be minimized . One is between the tags. Anther one is between the attribute gaps. The first one seems to be easy to remove(AbstraceTextProcessor). The latter one seems hard to process. I wrote a processor but seem too inefficient.
SpacesDialect
AttributesInnerWhitespacesProcessor、EmptyTextProcessor
Any alternative ideas?
Thymeleaf doesn't want to add it in their core.. instead it was added in ecosystem.. check out https://github.com/connect-group/thymeleaf-extras
I want to align to ligands in PyMOL like one would do it with protein structures, but I get an error message:
ExecutiveAlign: mobile selection must derive from one object only
I also copied the ligands into separate PDB files, renamed the HETATM entries to ATOM, but still I get this error. I am wondering why PyMOL has problems aligning those small molecules.
PS: Those ligands have similar structure, only different coordinates.
When you align with the align function pymol seeds the structural alignment by doing a sequence alignment first.
You can use the pair_fit function but will have to specify the corespondency between atoms. This function takes two selections, one for each element, that have the same number of atoms.
If the ligands have the exact same chemical structure you can pass the objects directly, if not you will have to make the appropriate selections.
Did you do this through the GUI? That's a bug, the align function never works from the GUI. Try doing it by command line.
align mol1, mol2
Using libclang, I have a cursor into a AST, which corresponds to the statement resulting from a macro expansion. I want to retrieve the original, unexpanded macro text.
I've looked for a libclang API to do this, and can't find one. Am I missing something?
Assuming such an API doesn't exist, I see a couple of ways to go about doing this, both based on using clang_getCursorExtent() to obtain the source range of the cursor - which is, presumably, the range of the original text.
The first idea is to use clang_getFileLocation() to obtain the filename and position od the range start and end, and to read the text directly from the file. If I've compiled from unsaved files then i need to deal with that, but my main concern with this approach is that it just doesn't seem right to be going outside to the filesystem when I'm sure clang holds all this information internally. There also would be implications if the AST has been loaded rather than generated, or if the source files have been modified since they were parsed.
The second approach is to call clang_tokenize() on the cursor extent. I tried doing this, and found that it fails to produce a token list for most of the cursors in the AST. Tracing into the code, it turns out that internally clang_tokenize() manipulates the supplied range and ends up concluding that it spans multiple files (presumably due to some effect of the macro expansion), and aborts. This seems incorrect to me, but I do feel that in any case I'm abusing clang_tokenize() trying to do this.
So, what's the best approach?
This is the only way I've found.
So you get the top level cursor with clang_getTranslationUnitCursor(). Then, you do clang_visitChildren(), with the visitor function passed into this returning CXChildVisit_Continue so that only the immediate children are returned. Among the children, you see the usual cursor types for top level declarations (like CXCursor_TypedefDecl, CXCursor_EnumDecl) but among them there's also CXCursor_MacroExpansion. Every single macro expansion appears to show up in a cursor with this type. You can then call clang_tokenize() on any of these cursors and it gives you the unexpanded macro text.
I have no idea why macro expansions get stuck near the top of the AST instead of within elements where they get used, it makes things pretty awkward. Example:
enum someEnum{
one = SOMEMACRO,
two,
three
}
It'd be nice if the macro expansion cursor for SOMEMACRO were within the enum declaration instead of being a sibling to it.
(I realize this is ridiculously late but I'm hoping this gets libclang more exposure, maybe someone more experienced with it can provide more insight).