Tika and parsing data with row and col spans

Tika and parsing data with row and col spans - parsing

I am searching for this for the past 2 days but its difficult to find. but the problem here is when you search for col spans in google with any word, different and variant documents will show that are irrelevant.
The question: is it possible to use tika apache parser, to retrieve or get parsed data from different type of documents with the col spans and row span as xhtml. if yes is there a tutorial or any document that can help me with that.

Unfortunately not, out o the box.
You would need to extend the base library used to parse spreadsheets to get this information into the Tika output.
An alternative would be to use EPPlus

Related

Google Sheets: Parsing users who are Program Managers

I have a spreadsheet that is always updating with 50+ rows. I am trying to retrieve users who are Program Managers (PGM) by parsing text but I am having a hard time since the data is not consistent since it's filled out by 20+ users.
I googled "google sheet parse text" but it's giving me functions such as =SEARCH, =LENS, =LEFT which I cannot use since my data is not consistent. Are there any other options or am I out of luck and must parse my info manually? Thanks in advance.
Google Sheet Link Example

in C2 use:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(B2:B,"PGM:.*")))

You may try:
=byrow(B2:B,lambda(z,if(z="",,ifna(regexextract(z,"(PGM:.*)(?:\n|$)")))))
row 6 outcome is varying a bit

Find sector by share symbol in google sheets

Is there a way to print out the GICS sector name for a specific share/ETF symbol in google sheets using the GOOGLEFINANCE commands or any other way?
Many thanks

I used this site to find several scraping methods to get data from finviz.
https://decodingmarkets.com/scrape-stock-data-from-finviz/
Extending their logic, I was able to get the company name, and the combined sector/subsector codes
(I originally used the website's scraping techniques to get Dividend data that GoogleFinance formula lacks...)
This formula gets the company name using US ticker symbol in cell C3:
=SUBSTITUTE(INDEX(IMPORTHTML("http://finviz.com/quote.ashx?t="&C3,"table",6),2,1),"*","")
Through trial and error, I found that table 6 has name and sectors. I then referenced the 2nd row and 1st column to get the name.
I found that row 3, column 1 has the sector, subsector and country combined as one value. They use a pipe | delimiter for each break.
Using the split function, I was able to split segment.
=SPLIT(SUBSTITUTE(INDEX(IMPORTHTML("http://finviz.com/quote.ashx?t="&C3,"table",6),3,1),"*",""),"|",true,true)

Its not available from Sheets
Check out the official docs:
https://support.google.com/docs/answer/3093281?hl=en
It has a lot of options but unfortunately, not that one.
If you think it would be useful, then make sure you file a feature request #
https://developers.google.com/issue-tracker
As for any other way
#GSee said it best here: https://stackoverflow.com/a/16525782/10445017

QUERY function not including text cells along with number cells in result

I'm currently trying to copy a list using the QUERY function in google-sheets.
The problem im now facing is that words / letters are not included in the search.
Example picture
Im using the function: "=QUERY(E2:F5;)" but don't get the words included.
Is there any way to include these words by using the formula above as guide?

In google-sheets, use Format, Number, Plain Text on your source range of E2:F5 and your original formula will work.
=QUERY(E2:F5)
From Docs Editor Help - QUERY function
In case of mixed data types in a single column, the majority data type determines the data type of the column for query purposes. Minority data types are considered null values.

How to convert HTML formatting inside google sheets rows to their correct formatted equivalent

I have been looking for a solution to convert a database I have with HTML formatting in one of the columns to its "normal" text equivalency in google sheets. A lot of the solutions I've found dealt with writing programs to do this or using Excel, so they unfortunately didn't pertain well enough.
For example in one of my columns I have;
Fast (<i> This character deals damage before non-<b>Fast</b> characters in combat.</i>)
But I would like to be able to have a somewhat streamlined solution to convert the above to:
Fast (This character deals damage before non-Fast characters in combat.)

AFAIK, with Google Sheet API, you can only format text within a cell. As mentioned in documentation, conditional formatting lets you format cells so that their appearance changes dynamically according to the value they contain, or to values in other cells.
You may want to request new feature here.

Reading Excel formulae using Ruby

I'm trying to use the Spreadsheet gem to parse XLS files that store information about school courses. These XLS files are automatically generated, so I cannot change the presentation of data.
Course schedules are saved as a list of characters, with dashes representing days in which the class does not meet. An example would be "3--33--", meaning the class meets during block 3 on days 1, 4, and 5 in the rotation. Excel parses some of these schedules as formulae, meaning that I need to read the formula itself from certain cells.
The problem is that when I try to read the data from a formula cell, using cell.data, the result is a string like \r\x00\x1F\x00\x00\x00\x00\x00\xD0\x84\xC0\x1EB\x00\x04. I'm assuming that this is Ruby's attempt to print the data as ASCII text. After some research, I have learned that Excel stores formulae in RPN format.
In short: I'm not sure how to go about reading a formula (the formula itself, not the formula's calculated value) from an Excel spreadsheet. I can't change the input Excel spreadsheet, and having a purely Ruby solution would be nice, since I'm planning on using this with Rails.

A different approach is:
convert it to csv using xls2csv: http://linux.die.net/man/1/xls2csv
read it using the ruby standard lib: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html
I hope this can help you.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Tika and parsing data with row and col spans - parsing

Unfortunately not, out o the box. You would need to extend the base library used to parse spreadsheets to get this information into the Tika output. An alternative would be to use EPPlus

Related

Google Sheets: Parsing users who are Program Managers

Find sector by share symbol in google sheets

QUERY function not including text cells along with number cells in result

How to convert HTML formatting inside google sheets rows to their correct formatted equivalent

Reading Excel formulae using Ruby

Categories

Resources