Skip to content

Stripped content and markup #67

@VirginiaBalseiro

Description

@VirginiaBalseiro

We are getting the following unexpected output when parsing HTML:

Input:

<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" />
    <title></title>
    <meta content="width=device-width, initial-scale=1" name="viewport" />
  </head>

  <body about="" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# schema: http://schema.org/">
    <main>
      <article>
<div datatype="rdf:HTML" id="content" property="schema:description">
  <p>foo</p>
  <div rel="schema:hasPart" resource="#bar">
    <p property="schema:description" datatype="rdf:HTML"><span>bar</span></p>
  </div>
</div>
      </article>
    </main>
  </body>
</html>

Output:

<https://dokie.li/tmp/test.html#bar> <http://schema.org/description> "<span xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/description> "\n  <p xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">foo</p>\n  <div rel=\"schema:hasPart\" resource=\"#bar\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">\n    \n  </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/hasPart> <https://dokie.li/tmp/test.html#bar> .

Expected ( from http://rdf.greggkellogg.net/distiller ):

<http://example.org/> <http://schema.org/description> "\n  <p>foo</p>\n  <div rel=\"schema:hasPart\" resource=\"#bar\">\n    <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n  </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<http://example.org/> <http://schema.org/hasPart> <http://example.org/#bar> .
<http://example.org/#bar> <http://schema.org/description> "<span>bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .

Note the missing markup and content inside of the div (\n <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n )

Is this a bug in rdf-ext / rdfa-streaming-parser, or does the issue perhaps lie on our end somehow? It'd be great if you can preproduce / confirm.

Originally posted by @csarven in #66

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions