The Tokenizer and Lexer Story; Or, A Closer Look At XML Shenanigans

Just a few weeks ago I mentioned it is not possible to parse XML,
with regular expressions, but yesterday I had a fun time doing just that.

This is because well made Configuration Files do not nest the same elements,
it is OK to have siblings with the same name, but nesting is different.

Because the opening tag will not be able to recognize its closing tag,
when it has another instance of it self within it self.

So when you say open box, and then again open a new box inside the open box,
you now have two box closing tags, and they look the same to naive parsers like mine.

If an XML file is made with this in mind, and never nests the same thing,
within it self, then it will work fine.

My XML parser can be further reduced to just one line of code here,
but nesting elements or tags with the same name will not work.

The miniature program just grabs the first closing that it sees,
and when things are nested like a Russian doll, that will not be the right one.

---

I woke up this morning, and without thinking about my code,
and not wanting to test it…

I just said to myself, regular expressions, the technology I relied on,
cannot do that.

And I know this from attempts at parsing XML,
back when it was cool, and programmers randomly mention this.

But they may not know, the problem is only with nesting the same tags,
if it is a configuration file, it will work.

But if it is a document with formatting boxes, and they can be nested,
nope, it will be the wrong closing tag.

---

To properly parse an XML file, you need two unique technologies,
a tokenizer and a lexer.

The tokenizer crawls the XML document from the first character,
to the last, ignoring new lines it just sees a long line.

Tokenizer, extracts tokens,
and [you have to register patters or rules that it should recognize][0].

Xml continues being extremely self similar,
so there are only six patterns, that is a small number.

Because we are eating every letter of the file,
we have to process spaces and new lines, throwing them away.

And then there is the convenient self closing tag,
instead of creating a closing tag you use a forward slash in the opening tag.

Then there is the open tag,
and the close tag.

----- snip ----- (Sorry, 5,000 letter limit in summaries see catpea.com or visit [ Ссылка ] for source-code) ----- snip -----

...ual objects, that live in a nested tree.

Whenever an open tag is found, it is added to the tree,
to build the resulting document up.

Then added to a stack,
and set as the new parent for elements that follow.

We go as deep as possible though all the opening tags,
building a tree, until a deepest tag becomes a parent.

And because it is the deepest tag,
it gets closed right away.

And now its outer tag becomes the parent again,
as that nested tag is now correctly placed in the tree.

Here we may open a new tag,
if there was stuff below it, for example an image beneath a title.

But slowly, we start making our way back,
out the deepest parts of the tree, emptying our stack of parents.

And finally, having the lexer,
return the now fully inflated tree.

That a moment ago, was just our tokens,
informing us of opening and closing instructions.

[Here our tree, called the abstract syntax tree][d],
is [returned to whatever requested the parsing][e] of [our XML document][f].

[a]: [ Ссылка ]
[b]: [ Ссылка ]
[c]: [ Ссылка ]
[d]: [ Ссылка ]
[e]: [ Ссылка ]
[f]: [ Ссылка ]
[0]: [ Ссылка ]
[1]: [ Ссылка ]
[2]: [ Ссылка ]
[3]: [ Ссылка ]
[4]: [ Ссылка ]
[5]: [ Ссылка ]

--------------------------------------------------------------------------------
Audio and full text version is available advertisement free at: [ Ссылка ] or visit [ Ссылка ] for source-code

The Tokenizer and Lexer Story; Or, A Closer Look At XML Shenanigans
Friday • November 29th 2024 • 8:20:04 pm

Смотрите далее

Shang-Chi and the Legend of the Ten Rings in Minutes | Recap

SYUKURLAHH! Naura dan Rahsya berhasil keluar dari dalam batu #magic5 #shorts

Spending 1,000,000 in Vietnam 🇻🇳 - Capsule Hotel and Cafe

VIRAL ???PRANK OJOL 😱berujung enak"???#shortsfeed #funny #ojol

Hot indian girl reels | brishti samaddar hot reels | reel hub

Buranovskiye Babushki - Party For Everybody - Russia - Live - Grand Final - 2012 Eurovision

Кайрат Нуртас - Ол сен емес (Official video)

Building Mandalorian Warrior LEGO Star Wars minifigure #lego #starwars #legostarwars #fy #fyp #fypシ

С добрым зимним утром. Красивое пожелание доброго утра и хорошего дня. #открытка

Sevilay'ın Kurtarıcısı Nuh... | Siyah Kalp 2. Bölüm

176

ZELJKA JURISIC -Varalica | Video 2025 | Avaz Produkcija

Кругосветное путешествие вместе с Хрюшей - Москва - География для детей

When Teen Bullies Get Reality Check

И.Н.М.Т.(1-7) но когда ЛАМПЧКА в кадре

Новые клипы

Тренды Развлечения