Archive for June, 2006

An Old Idea

I’ve been giving some thought to parsing microformats lately. A few threads seem to be converging…

The first is that it’s hard to parse microformats. You can hand-write a parser in a little bit of time that’s 80% right. But getting all of the hcard rules, e.g., encoded is tricky. It’s reasonable to assume, therefore, that there are a lot of 80% parsers out there like the one I wrote for my Ray Ozzie Clipboard example.

The second issue relates to hatom, which uses different class names for the same concept at different scopes. For example, the entry title is called “entry-title” not “title”. I asked Ryan about this when I saw him at www2006, and he told me that they vacillated on this decision, but they settled on “entry-title” because people can nest other microformats inside hatom, and so it would be easier for the parser writers if there were no colliding class names, even in different microformats. In fact, he suggested that they’d probably made a mistake with hcard, since the class names were so likely to collide with other microformats. Ok, so in other words “entry-title” is a hack around the problem of it being hard to parse microformats, and we can expect more of these.

When I bumped into Brian at the same event, I commented that microformats really have a problem with nesting. He agreed. He said it put a burden on the parser writer to potentially have to understand all microformats in order to reliably parse web pages that contain them.

So,

  1. It’s a lot of trouble to write a parser
  2. Bad parsers will proliferate
  3. Microformats are evolving toward being easier to parse, not easier to create
  4. It’s not clear how you can nest microformats w/o knowing how parsers will behave
  5. Users are discouraged from inventing their own specialized microformats, presumably because of the risk of collisions and difficulty others will have in parsing them

My proposal is that we employ a very old solution to this problem: create proper, machine-readable schemas or grammars for each microformat.

The schema…

  1. is a formal specification of the microformat
  2. can be used to generate parsers (like yacc)
  3. can be used to dynamically parse new microformats
  4. is language-neutral

Here’s a fragment of a schema for hcard in a BNF-inspired syntax:

{vcard} ::= {fn} {n} [{org}] [{url}] [{email}] [{photo}] [{tel}]
{n} ::= {fn}
{tel} ::= ({tel-entry})
{tel-entry} ::= [{type}] {value}
{url} ::= a@href
{email} ::= a@href
{photo} ::= img@src | object@data
{fn} ::= body
{org} ::= body
{type} ::= body
{value} ::= body

Note that it has domain-knowledge of HTML (e.g., “img@src”, which means pull the value out of the src attribute of an img tag, and “body” means pull the body of the tag). This syntax doesn’t encode all of the kinds of rules you’ll find in the hcard spec, but it probably could be extended to do so. (Note that a link could be added to the header of web pages pointing to the schema.)

So in addition to making it trivial to generate or find correct parsers for microformats in any language or environment, how does this solve the nesting problem? First, the parser will only “find” data that matches the schema. So if you stick a hcard inside an hatom entry, then the hatom parser wouldn’t be looking for the “title” beneath the “author”, since that’s not in the schema. Second, if you wanted to have a rule like that the DOM-depth were used to disambiguate two “title” properties, then you could enforce this at the parser-generator level, not at the level of every-parser-in-the-world. Third, it’s actually possible to use link tags to refer to every schema inside the web page, making it feasible that the parser would understand all of the microformats contained in the page without any additional work.

The other thing that’s interesting is that this specification actually implies a json-compatible data-model. The “( … )” notation refers to a list, the terminals refer to values, and each of the labels (e.g., “fn”) refer to keys in a name/value pair list. So we’d expect to parse,

<a class="url fn" href="http://smackman.com">Steve</a>

to

{vcard: {fn: "Steve", url: "http://smackman.com" }}

in JSON-syntax. (Don’t confuse JSON-syntax with JSON-data-model. The latter can be represented in (almost?) any programming language using built-in language constructs while the former is a serialization format).

So this means that the schema spec allows you to parse from HTML to a JSON-data-model. This means that, in contrast to yacc, there isn’t a need to have application-specific instructions in the spec. I’d also point out that the process of going in the opposite direction—from JSON-data-model to HTML—is exactly what microtemplates buy you.

That’s the gist of the idea… a lot more details to be worked out, of course.

hardcore lesbian sex
free home webcam sex clips
tit fuck
king of the hill cartoon porn
suck cock
teen gang bang
ebony teen
family sex clips
gay guys kissing
young latina lesbians
breast bondage
gay glory hole
nude teens
teengirls
free tranny
incest forums
teen violence
girl boobs
teen lesbians
anal fucking
xxx videos
girls being raped
horse dick
webcam girl live
blonde pornstar pussy
hot anime girls
celebrity fakes
suspension bondage
latino pretty girls
happy hentai
free anime
milf cruiser
animalsex
naked anime girls
hot girl webcam video
cute teen girls
hot japanese babes
free lesbian stories
paris hilton blowjob
paris hilton sex
fat free pussy picture
pink pornstars
gay cum shot
gloryhole movies
dildo teen girls webcam
ebony milf
ca
xxx video clip galleries
bukkake porn
ebony anal
incest rape
webcam strip boobs
closeups shaved pussy
porn star webcam
bisexual sex
bdsm library
webcam live sex video
moms tits
mom strips for son
mature handjob
nude photography
tranny trick
free xxx movie clips
pantyhose galleries
mother daughter sex
granny gallery
nude asians
xxx live webcam
ebony models
young teens
indian women
gallery of paris hilton
threesome erotic cams
bisexual gang bang
gang bang squad
nude fat girls
horse fuckers
shemale poonfarm
incest cartoons
free japanese schoolgirl
celebrities topless
milf challenge
amateur creampie
gigantic hairy blonde pussy
celebrity legs
mature women in stockings
nude asian
paris hilton sex
nude webcam movies
my big cock
indian lesbians
tranny bukkake
bisexual teen girl
huge dick large cock
latina anal
indian boobs
facial bukkake
girls on webcam live
hairy divas
celebrity skin
hentai comics
free hardcore sex stories
xxx nude sex webcam
adult xxx webcam chat
mature wife
gay cum
muscle hunks
amateur tits
bukake
amateur wife
big black tits
celebrity tits
russian teen porn
free webcam girl movie
nude boobs webcam
japanese tits
free live teen porn webcams
gay anal sex
farm sex
cartoon porn free
japanese girls hot
cartoon sex movies
double anal
simpsons movie
mature lesbians
mother sucks son
xxx teen webcam
girl suck
hairy vaginas
suck big cock
shoshone indians
hot pornstars
gigantic boobs
pantyhose tgp
milf camps
free rape movies
hot webcam nude clips
fat webcam couple sex
free huge cock videos
anime bondage
tawnee stone hardcore
asian porn
tits girl webcam
gay hairy men
brutal blowjobs
bisexual men free
teen latina lesbians
cum swallow
mature movies
bbw mature webcams free movies porn
beast sex
free monster cock
shemale fucking
lesbian cartoon porn
gay gang bang
japanese sex movies
nude asians
nude latin women
amateur curves
pantyhose videos
gloryhole locations
milky tits
asian hardcore
shemale bukkake
gloryhole videos
hot girls in tight panties
lesbian pussy
hardcore anal sex
free sex video
huge black tits
pairs hilton
fat sex webcam pics
dick in a pussy
free rape movies
chloe jones hardcore
monster cock fuck
black nude girls
pamela anderson pussy

Comments (3)