Using Unified to parse Markdown with HTML inside - parsing

I'm trying to convert GitHub-flavored markdown (GFM) text to a Markdown AST Node. Because GFM allows for HTML tags like <img>, I also want those parsed as Markdown elements as well.
Right now, I'm using a combination of Unified + Turndown to do so. Essentially, I am parsing the Markdown, converting all of it to raw HTML using Unified and several plugins, then converting it to raw markdown using Turndown, then using Unified again to parse the new raw markdown into an AST node, and then finally parsing that node as I want.
This is what I have:
import unified from 'unified';
import markdown from 'remark-parse';
import gfm from 'remark-gfm';
import TurndownService from 'turndown';
import remark2rehype from 'remark-rehype';
import html from 'rehype-stringify';
import raw from 'rehype-raw';
import {parseBlocks} from './parser/internal';
import type {Root} from './markdown';
import type {ParsingOptions} from './types';
const {gfm: turndownGfm} = require('turndown-plugin-gfm');
export async function markdownToBlocks(body: string) {
const turndownService = new TurndownService().use(turndownGfm);
const rawHtml = await unified()
.use(markdown)
.use(gfm)
.use(remark2rehype, {allowDangerousHtml: true})
.use(raw)
.use(html)
.process(body);
const rawMarkdown = turndownService.turndown(String(rawHtml));
const root = unified().use(markdown).use(gfm).parse(rawMarkdown);
return parseBlocks(root as unknown as Root);
}
It does exactly what I want to and works perfectly, however I can't help but think there's a better solution that's more efficient than how I do it right now, since I'm parsing it several times over.
Any advice for how I can improve this with fewer parsing steps would be appreciated, thanks!

I also want those parsed as Markdown elements as well.
HTML inside Markdown is allowed, there is no need to parse it as Markdown, just pass it along as-is.
Otherwise don't re-invent the wheel: Turn the Markdown into HTML and load it as DOMDocument if you need an object model you can traverse.

Related

Google Sheets Import XML Function error: Imported Xml content can not be parsed

I am trying to scrape data from crypto-exchanges. When i use the Import XML function in google sheets is gives the "Imported Xml content can not be parsed." error. The function I am using is
=IMPORTXML("https://www.binance.us/en/trade/pro/ADA_BTC%22,%22//*%5B#id=%22%22__APP%22%22%5D/div/div/div%5B6%5D/div/div/div/div%5B2%5D/div%5B1%5D/div/div%5B1%5D)")
that is the price data i am trying to get into the sheet.
I have tried to look into different x paths but they all return Imported content is empty error. Not sure if i am using the right Xpath.
the issue is with a site that uses JavaScript. google sheets IMPORT formulae do not support scraping of JS elements:
To retrieve informations you need, you can parse the json provided by Binance at https://api1.binance.com/api/v3/ticker/24hr.
function binance(url = 'https://api1.binance.com/api/v3/ticker/24hr') {
var data = JSON.parse(UrlFetchApp.fetch(url).getContentText())
data.forEach(function(d) {if (d.symbol == 'ADABTC') console.log(d.lastPrice)})
}
The other way is to use websockets at wss://stream.binance.com:9443/ws with this stream
{"method":"SUBSCRIBE","params":["adabtc#kline_15m","adabtc#kline_5m","adabtc#kline_1m"],"id":1}
you will recieve a json as follows
{"e":"kline","E":1649207018439,"s":"ADABTC","k":{"t":1649206800000,"T":1649207699999,"s":"ADABTC","i":"15m","f":106692274,"L":106692475,"o":"0.00002555","c":"0.00002551","h":"0.00002556","l":"0.00002550","v":"89755.10000000","n":202,"x":false,"q":"2.29071565","V":"27199.10000000","Q":"0.69455387","B":"0"}}

Read data from XLSX provided as XSTRING

An Excel file (.xlsx) is uploaded on the frontend which is UI5 Fiori.
The file contents come to SAP ABAP backend via ODATA in XSTRING format.
I need to store that XSTRING into an internal table and then in a DDIC table. Eg: Suppose the Excel has 5 columns then I want to store that data of 5 columns in the corresponding columns in the DDIC table.
I have tried various Function Modules like:
SCMS_XSTRING_TO_BINARY
SCMS_BINARY_TO_STRING
and following Classes & methods:
cl_bcs_convert=>raw_to_string
cl_soap_xml_helper=>xstring_to_string
but none were able to convert the XSTRING to STRING.
Can you please suggest which function module or class/method can be used to solve the problem?
For most comfort, use abap2xlsx.
If you cannot or do not want to use that, you can alternatively parse the Excel file on your own. .xlsx files are basically .zip files with a different file ending. Use cl_abap_zip->load to open the xstring you receive and ->get to extract the individual files from the zip. Afterwards, use XML parsers like cl_ixml or transformations to parse the XML content of the files.
Note that Excel's XML is a complicated file format, with several files that work together to form the worksheets. Refer to Microsoft's File format reference for Word, Excel, and PowerPoint for details. It's non-trivial to interpret this, so you will usually be a lot happier with abap2xlsx.
abap2xlsx is the most powerful and feature-rich way of doing this, as said by Florian, it supports styles, charts, complex tables, however it may not be always available due to the system limitations, restrictions to install custom packages in system or whatever.
Here is the way how to accomplish this with pure standard without using custom frameworks.
Since Netweaver 7.02 SAP supports Open Microsoft formats natively and provides classes for handling them: CL_XLSX_DOCUMENT, CL_DOCX_DOCUMENT and CL_PPTX_DOCUMENT, abap2xlsx is built at these classes too, yes. So let's start a bit of reinventing the wheel.
XLSX file is an OpenXML archive of files, of which the most interesting: sheet1.xml and sharedStrings.xml. Let's build a sample based on MARC table fields
Now you want to transfer this table to internal table with the same structure. The steps would be:
Extract needed files from XLSX archive
Read worksheet structure from sheet1.xml
Read sheet values from sharedStrings.xml
Map them together and write the result to the internal table
Here is the sample class that handles the job, I used the cl_openxml_helper applet to load XLSX, but you can receive XSTRINGed XLSX in whatever way.
CLASS xlsx_reader DEFINITION.
PUBLIC SECTION.
TYPES: BEGIN OF ty_marc,
matnr TYPE char20,
werks TYPE char20,
disls TYPE char20,
ekgrp TYPE char20,
dismm TYPE char20,
END OF ty_marc,
tt_marc TYPE STANDARD TABLE OF ty_marc WITH EMPTY KEY.
METHODS: read RETURNING VALUE(tab) TYPE tt_marc,
extract_xml IMPORTING index TYPE i
xstring TYPE xstring
RETURNING VALUE(rv_xml_data) TYPE xstring.
ENDCLASS.
CLASS xlsx_reader IMPLEMENTATION.
METHOD read.
TYPES: BEGIN OF ty_row,
value TYPE string,
index TYPE abap_bool,
END OF ty_row,
BEGIN OF ty_worksheet,
row_id TYPE i,
row TYPE TABLE OF ty_row WITH EMPTY KEY,
END OF ty_worksheet,
BEGIN OF ty_si,
t TYPE string,
END OF ty_si.
DATA: data TYPE TABLE OF ty_si,
sheet TYPE TABLE OF ty_worksheet.
TRY.
DATA(xstring_xlsx) = cl_openxml_helper=>load_local_file( 'C:\marc.xlsx' ).
CATCH cx_openxml_not_found.
ENDTRY.
"Read the sheet XML
DATA(xml_sheet) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 2 ).
"Read the data XML
DATA(xml_data) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 3 ).
TRY.
* transforming structure into ABAP
CALL TRANSFORMATION zsheet
SOURCE XML xml_sheet
RESULT root = sheet.
* transforming data into ABAP
CALL TRANSFORMATION zxlsx_data
SOURCE XML xml_data
RESULT root = data.
CATCH cx_xslt_exception.
CATCH cx_st_match_element.
CATCH cx_st_ref_access.
ENDTRY.
* mapping structure and data
LOOP AT sheet ASSIGNING FIELD-SYMBOL(<fs_row>).
APPEND INITIAL LINE TO tab ASSIGNING FIELD-SYMBOL(<line>).
LOOP AT <fs_row>-row ASSIGNING FIELD-SYMBOL(<fs_cell>).
ASSIGN COMPONENT sy-tabix OF STRUCTURE <line> TO FIELD-SYMBOL(<fs_field>).
CHECK sy-subrc = 0.
<fs_field> = COND #( WHEN <fs_cell>-index = abap_false THEN <fs_cell>-value ELSE VALUE #( data[ <fs_cell>-value + 1 ]-t OPTIONAL ) ).
ENDLOOP.
ENDLOOP.
ENDMETHOD.
METHOD extract_xml.
TRY.
DATA(lo_package) = cl_xlsx_document=>load_document( iv_data = xstring ).
DATA(lo_parts) = lo_package->get_parts( ).
CHECK lo_parts IS BOUND AND lo_package IS BOUND.
DATA(lv_uri) = lo_parts->get_part( 2 )->get_parts( )->get_part( index )->get_uri( )->get_uri( ).
DATA(lo_xml_part) = lo_package->get_part_by_uri( cl_openxml_parturi=>create_from_partname( lv_uri ) ).
rv_xml_data = lo_xml_part->get_data( ).
CATCH cx_openxml_format cx_openxml_not_found.
ENDTRY.
ENDMETHOD.
ENDCLASS.
zsheet transformation:
<?sap.transform simple?>
<tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main">
<tt:root name="root"/>
<tt:template name="main">
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac=
"http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3=
"http://schemas.microsoft.com/office/spreadsheetml/2016/revision3">
<tt:skip count="4"/>
<sheetData>
<tt:loop name="row" ref="root">
<row>
<tt:attribute name="r" value-ref="row_id"/>
<tt:loop name="cells" ref="$row.ROW">
<c>
<tt:cond><tt:attribute name="t" value-ref="index"/><tt:assign to-ref="index" val="C('X')"/></tt:cond>
<v><tt:value ref="value"/></v>
</c>
</tt:loop>
</row>
</tt:loop>
</sheetData>
<tt:skip count="2"/>
</worksheet>
</tt:template>
</tt:transform>
zxlsx_data transformation
<?sap.transform simple?>
<tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main">
<tt:root name="ROOT"/>
<tt:template name="main">
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<tt:loop name="line" ref=".ROOT">
<si>
<t>
<tt:value ref="t"/>
</t>
</si>
</tt:loop>
</sst>
</tt:template>
</tt:transform>
Here is how to call it:
START-OF-SELECTION.
DATA(reader) = NEW xlsx_reader( ).
DATA(marc) = reader->read( ).
The code is pretty self-explanatory, but let's put a couple of notes:
File sheet1.xml contains a special attribute t in each cell which denotes either the value should be treated as a literal or a reference to sharedStrings.xml
I used two simple transformations but XSLT can be used as well, possibly allowing you to reduce all XML stuff to single transformation
I deliberately used generic char20 types to be able to handle headers. If you wanna preserve native types, then you cannot read table header (skip the first line in sheet LOOP), because you'll receive type violation and dump. If you receive table without headers, then it is fine to declare structure with native types
If you don't want to use transformations then sXML is your friend. You can parse XML with classes as well, but ST transformation are considerably faster
With some additional effort you can make this snippet dynamic and parse XLSX with any structure
You can read more about this approach in this doc.

Dart Markdown package, how to handle new lines

I am trying to make a WYSIWYG internal tool. And we decided to implement this feature with contentEditable. However, we save data to our databases in markdown. So I have to be able to parse from html to md and back. For html to md I use package html2md and for the other way around I use Markdown package.
The issue i've been having is that when you write to my editor text like
HEY
After many lines some text
It produces this in md
HEY
After many lines some text
Notably it uses 2 whitespace and 2 LF characters (or atleast i think so but i might be slightly wrong.) I solved this issue by parsing it like this
markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), inlineSyntaxes: [TextSyntax(String.fromCharCodes([32,32,10,10]),sub: "<div><br></div>")],inlineOnly: true );
The inline only parameter was neccesary because without it the text syntax wasnt applied for some reason. However this inline only then bit me in the arse when I tried to implement parsing of unordered lists, which are parsed as blocks. So I need a way to correctly parse these empty lines without using inline only.
class EmptyLineBlockSyntax extends BlockSyntax{
RegExp get pattern => RegExp(r'^(?:[ \t][ \t]+)$');
const EmptyLineBlockSyntax();
Node parse(BlockParser parser) {
parser.encounteredBlankLine = true;
parser.advance();
return Element('p',[Element.empty('br')]);
}
}
return markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), blockSyntaxes: [EmptyLineBlockSyntax()]);

How to retrieve an XMLTYPE data that contains special characters?

I want to retrieve XMLTYPE data from an Oracle table using cx_oracle.
the data looks like this:
<infos>
<Comment/>
<Observation>àéèç</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
Here's my code:
#!/usr/bin/python
from __future__ import print_function
import cx_Oracle
# Connection to RTDIAG
try:
dsn_test = cx_Oracle.makedsn(host='xxxxx',port='1521',service_name='xxxxx')
con_test = cx_Oracle.connect(user='xxxx', password='xxxxx',dsn=xxxx)
except cx_Oracle.InterfaceError:
print ("Impossible to connect to the DB!")
print ("***exit script***")
quit()
ID_record = 1729
cursor = con_test.cursor()
query = """select a.content.getClobVal() from emb_log a where ID = :id and uncompleted_record=1
"""
cursor.execute(query,id=1729)
xml_retrieved = cursor.fetchone()[0].read() #string
print (xml_retrieved)
Here's what I get
<infos>
<Comment/>
<Observation>aeec</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
The special characters contained within the XML child is not being retrieved proprely. They are converted in 'ascii like' characters.
Why and how to fetch the XML exactly the way it appears in the DB ?
Thank you.
Set your NLS environment. You will probably find it easiest to use
the
encoding
option when you connect.
for performance, you will want to fetch the CLOB via an OutputTypeHandler

Preprocessing Scala parser Reader input

I have a file containing a text representation of an object. I have written a combinator parser grammar that parses the text and returns the object. In the text, "#" is a comment delimiter: everything from that character to the end of the line is ignored. Blank lines are also ignored. I want to process text one line at a time, so that I can handle very large files.
I don't want to clutter up my parser grammar with generic comment and blank line logic. I'd like to remove these as a preprocessing step. Converting the file to an iterator over line I can do something like this:
Source.fromFile("file.txt").getLines.map(_.replaceAll("#.*", "").trim).filter(!_.isEmpty)
How can I pass the output of an expression like that into a combinator parser? I can't figure out how to create a Reader object out of a filtered expression like this. The Java FileReader interface doesn't work that way.
Is there a way to do this, or should I put my comment and blank line logic in the parser grammar? If the latter, is there some util.parsing package that already does this for me?
The simplest way to do this is to use the fromLines method on PagedSeq:
import scala.collection.immutable.PagedSeq
import scala.io.Source
import scala.util.parsing.input.PagedSeqReader
val lines = Source.fromFile("file.txt").getLines.map(
_.replaceAll("#.*", "").trim
).filterNot(_.isEmpty)
val reader = new PagedSeqReader(PagedSeq.fromLines(lines))
And now you've got a scala.util.parsing.input.Reader that you can plug into your parser. This is essentially what happens when you parse a java.io.Reader, anyway—it immediately gets wrapped in a PagedSeqReader.
Not the prettiest code you'll ever write, but you could go through a new Source as follows:
val SEP = System.getProperty("line.separator")
def lineMap(fileName : String, trans : String=>String) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line) + SEP
).toIterable
)
}
Explanation: flatMap will produce an iterator on characters, which you can turn into an Iterable, which you can use to build a new Source. You need the extra SEP because getLines removes it by default (using \n may not work as Source will not properly separate the lines).
If you want to apply filtering too, i.e. remove some of the lines, you could for instance try:
// whenever `trans` returns `None`, the line is dropped.
def lineMapFilter(fileName : String, trans : String=>Option[String]) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line).map(_ + SEP).getOrElse("")
).toIterable
)
}
As an example:
lineMapFilter("in.txt", line => if(line.isEmpty) None else Some(line.reverse))
...will remove empty lines and reverse non-empty ones.

Resources