Remark: How to parse HTML tags and their content in MDAST - parsing

I'm trying to parse a GitHub-flavoured markdown file using Unified and Remark-Parse to generate a MDAST. I'm able to parse most of it correctly and easily, however I'm having trouble parsing the HTML tags and their content from the AST.
In the AST, HTML tags and their contents are represented as siblings, not parent-child. For example <sub>hi</sub> is parsed into
[
{
"type": "paragraph",
"children": [
{
"type": "html",
"value": "<sub>",
},
{
"type": "text",
"value": "hi",
},
{
"type": "html",
"value": "</sub>",
}
]
}
]
Ideally, I would want it to be parsed like
[
{
"type": "paragraph",
"children": [
{
"type": "html",
"value": "sub",
"children": [
{
"type": "text",
"value": "hi",
},
]
},
]
}
]
so that I can access the tag type and its content. (Specifically, my goal is to just skip over the tags and their content as they are not needed for my purposes)
This is the configuration I am using currently:
import unified from 'unified';
import markdown from 'remark-parse';
import type {Block} from '#notionhq/client/build/src/api-types';
import {parseRoot} from './internal';
import gfm from 'remark-gfm';
export function parseBody(body: string): Block[] {
const tokens = unified().use(markdown).use(gfm).parse(body);
return parseRoot(tokens);
}
So, my question is: Is there a way of configuring Remark to do so / is there a Remark plugin to do this? If not, how would I go about creating a plugin that does so?
Thanks.

first: why the AST looks as it does and why Remark most likely does not have an option to do it differently
The reason that the AST represents it that way is because that is what the CommonMark specification specifies for raw inline HTML and for HTML blocks. Specifically, CommonMark specifies that HTML tags are passed through, not parsed.
For inline HTML, the spec supports inline HTML tags, which is not the same as supporting inline HTML. Tags are simply passed through as-is. There is no matching of opening and closing tags. The reasons for this are:
performance
parser complexity
HTML tags are only supported as a "use at your own risk" "last resort" option when Markdown doesn't have a feature you need.
For a small number of HTML tags, open and close tag matching is supported at the block-level. pre, script, style, and textarea, the latter only added recently in v0.30 of the spec.
You can read the above linked parts of the spec, and search the discussions in the CommonMark forum to get more understanding of the whys, but to get right to the point, read:
This explanation within the spec for the choices made.
Skip to [the Raw HTML section of this forum]( the https://talk.commonmark.org/t/beyond-markdown/2787?u=vas) post by the CommonMark spec author and maintainer, John MacFarlane (#jgm).
This forum question and also this one and #jgm's answers.
second: what you can do about it
Remark is "part of the unified collective", which is an infrastructure centered around the processing of AST (abstract syntax trees). From your question, it sounds like you already get this.
There is lot's of help on unified's pages for how to write plugins:
https://github.com/unifiedjs/unified
https://unifiedjs.com/learn/guide/create-a-plugin/
But the best way to both learn how to do this and to get a quick jump on an implementation is to look at the many existing mdast-specific manipulators.

Related

Unfurl links blocks template

I'm using the link_shared events to unfurl links in my workspace, trying to generate a template that is as close to Slack's unfurling template as possible, but I have several issues -
Blocks have very large spacing between them, causing my 3 blocks to take a lot of space
I'm unable to have an image inlined with the text for the title, unless I'm using context, but this is causing the text to be very small.
Taking Slack's example of how link unfurling should look like and trying to mimic it with blocks should explain the differences. This is the blocks message, and here you can see the result as an image
So my main question is - does Slack use some internal blocks formatting not available in the API, or is it possible to achieve the same result?
Thanks a lot!
{
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":pager: *Slack*"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*<https://slack.com/features|Features>*"
}
},
{
"type": "image",
"title": {
"type": "plain_text",
"text": "Slack is where work flows. It's where the people you need, the information you share, and the tool you use come together to get things done.",
"emoji": true
},
"image_url": "https://a.slack-edge.com/13f94ee/marketing/img/homepage/self-serve-campaign/unfurl/img-unfurl-ss-campaign.jpg",
"alt_text": "Slack"
}
]
}
That example is not using the Slack block unfurl - it's an example of how a generic link would be displayed using the page's meta tags to display some additional information, using the favicon image.
If you wanted to create something similar you could use use a markdown block and an image block (like this) - but the file size would be displayed on a new line rather than after the text.
It took a bit of playing around, but I realized Slack is actually using message attachments (the legacy version of message formatting) in order to generate their link unfurls.
For example, if you want to unfurl a GitHub repository link, this is the payload you should send, and it'll generate an almost identical unfurling to what Slack is generating (a small Added by {app-name} will be added to the footer) -
unfurls["https://github.com/slackapi/bolt-js/"] = {
author_name: "GitHub",
author_icon: "https://a.slack-edge.com/80588/img/unfurl_icons/github.png",
title: "GitHub - slackapi/bolt-js: A framework to build Slack apps using JavaScript",
title_link: "https://github.com/slackapi/bolt-js/",
text: "A framework to build Slack apps using JavaScript. Contribute to slackapi/bolt-js development by creating an account on GitHub.",
image_url: "https://opengraph.githubassets.com/3e06f7eee96f05a53cd4905af3b296dfe333be7a902bb3e6a095770e87fd17fe/slackapi/bolt-js"
}

Is there a way to generate contentful rich text on a web page?

Contenful has their own rich text format in a json style form:
{
"nodeType": "document",
"data": {},
"content": [
{
"nodeType": "paragraph", // Can be paragraphs, images, lists, embedded entries
"data": {},
"content": [
{
"nodeType": "text",
"value": "This text is ",
"data": {},
"marks": []
},
{
"nodeType": "text",
"value": "important",
"data": {},
"marks": [
"type": "bold" // Can be bold, underline, italicss
]
}
]
}
]
}
Is there any way to generate this type of rich text besides in contenful?
I'd love to use other rich text editors like for example vue-quill-editor
But they generate html as an output, meaning there is no way for me to add the content to a contentful database in a meaningful way.
Interested in ideas on this.
Contentful DevRel here.👋
If this editor is producing HTML it won't be compatible with Contentful's RichText field type.
If you want to use your own editor inside the Contentful UI you can always use the App Framework to extend the interface. The App Framework allows you to define custom UI in various locations such as an entry's field. If you want to use this HTML-based editor you could create an app using the App Framework and render it in a "long text" field to store your HTML.
But be aware, if you store HTML in Contentful, you're loosing cross-platform content partibility. Contentful is JSON-based to keep the content platform agnosting. If you start storing HTML you might have a hard time reusing your content on platforms that are not able to render HTML (e.g. iOS).

System for data validation and class generation (Avro vs Json Schema vs OpenAPI)

We want to have a system that allows us to define data schemas that we can use to validate our data, and to generate code in specific languages. We found json schema's that lets us do something like
File "message.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Message",
"properties": {
"name": {
"type" : "string"
},
"type": {
"$ref": "type/message_type.schema.json"
},
"message_id":{
"$ref": "type/uuid.schema.json"
}
},
"required": ["name", "message_id"]
}
File "message_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "MessageType",
"enum": ["Message", "Query"]
}
File "uuid_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "UUID",
"type": "string",
"pattern": "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
}
File "query.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Query",
"allOf" : [ {"$ref": "type/message.schema.json" }],
"required": ["type"]
}
Please ignore if there is something that doesn't make sense but the point is, we really enjoy this system because it allows us to define types, and to refer to types that we create in another files, and even to use them like for type inheritance.
Then we want to use this files for code generation and validation. In python we then use a library called python_jsonschema_objects that can parse this files and the files that it references recursively, and we can then really simply create a python object with all the validation included.
But then we also want to use them for Java/Kotlin but the library that we found jsonschema2pojo doesn't seem able to parse linked files expecting everything to be in the same file.
This leads us to think that for some reason Json Schema is not that supported or used, unfortunately.
So, we have the question if a system like Avro or OpenAPI would be better supported and more widely used and could be chosen to this type of task.

Obj. C - how to deal with wrong written JSON

I obtained some badly written JSON file that's totally not readable through any AFNetworking and/or any other JSON serializing library. I must underline that I am not able to force to change it on server-side so I have to parse it as is.
Unfortunately it has some minor errors (I will paste small part of it):
locations = [{
"city": "Tokio",
(...)
"link": "http://somethig.com",
"text": "Mon-Fr.",
}, {
(... same repeated mistakes, but they're not regular)
}]
Aaand to parse it correctly in XCode i need to change it to correct format e.g.:
{
"locations": [{
"city": "Tokio",
(...)
"link": "http://somethig.com",
"text": "Mon-Fr."
}, {
(...)
}]
}
Do you have any idea how to deal with that?
If I will have to write my own parser - please advice me how to. Any help will be appreciated. I download this JSON from http link.
It looks like the API is sending you a string of JavaScript code. In JavaScript, you would use JSON.stringify() to convert the object to valid JSON. Here's an example from an interactive node.js shell:
> locations = [{
... "city": "Tokio",
... "link": "http://somethig.com",
... "text": "Mon-Fr.",
... }]
[ { city: 'Tokio', link: 'http://somethig.com', text: 'Mon-Fr.' } ]
> JSON.stringify(locations)
'[{"city":"Tokio","link":"http://somethig.com","text":"Mon-Fr."}]'
If you really can't change this on the server side, you might try creating a hidden UIWebView, adding the JavaScript, calling JSON.stringify(), and extracting the result. However, this will use many more resources than are needed. It would be much better (in terms of computing power, memory, and your time) to have the API call JSON.stringify() and send you a valid JSON string.
You could also try porting some of json3.js to Objective-C or Swift. I would start with the section beginning // Public: `JSON.stringify`. if you attempt this.

swagger-codegen: how do I retrieve tag description?

In my swagger file I have defined a list of tags as follows:
"tags": [
{ "name": "TagA", "description": "DescriptionA" },
{ "name": "TagB", "description": "DescriptionB" }
]
When I generate client code using swagger-codegen (2.1.2-M1), all operations marked with a certain tag become methods in a class named after the tag, e.g. "class TagBApi". Is there any way to retrieve the tag description and output it as a comment in the class? I haven't seen any examples of this in the available .mustache files. Thanks.
there is no support for the tag descriptions in the codegen--please open an issue and it can be added.

Resources