dot_parser.py: Why does ID not have the `float_number` production rule? #436

loonatick-src · 2024-11-08T15:17:30Z

An ID is one of the following:

Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores ('_') or digits([0-9]), not beginning with a digit;

a numeral [-]?(.[0-9]⁺ | [0-9]⁺(.[0-9]*)? );

any double-quoted string ("...") possibly containing escaped quotes (")¹;

an HTML string (<...>).

An ID is just a string; the lack of quote characters in the first two forms is just for simplicity. There is no semantic difference between abc_2 and "abc_2", or between 2.34 and "2.34". Obviously, to use a keyword as an ID, it must be quoted.

I see that that the second rule (numeral) is omitted from ID dot_parser.py and there is instead a righthand ID that satisfies the complete definition. Is there a particular reason for this?

The text was updated successfully, but these errors were encountered:

ferdnyc · 2024-11-08T16:40:33Z

A lot of things in dot_parser are Very Legacy™, its parsing of identifiers being one of them.

the ID definition is actually based on identifier, which is defined earlier in the grammar as:

identifier = Word(
    pyparsing_unicode.BasicMultilingualPlane.alphanums + "_."
).setName("identifier")

The addition of _ and . allows it to support most floating-point values, though (a) it can't be a negative value, and (b) nothing enforces the fact that an alphanumeric string can't start with a digit.

I suspect it was defined that way partly because ID was also used for both sides of the port definition, and the part of port after the colon can't be a floating-point value. This part of the parser definition would break, if ID were fully defined to accept all legal graphviz syntax:

port = (
    Group(Group(colon + ID) + Group(colon + ID))
    | Group(Group(colon + ID))
).setName("port")
node_id = ID + Optional(port)

But you're correct that the parser definition excludes valid graphviz syntax, as a result.

Both pydot and graphviz will accept this:

graph G { 1.14 [color=red]; }

but pydot will balk at this perfectly valid graph, which graphviz also accepts:

graph G { -1.14 [color=red]; }

...The - at the start of -1.14 breaks dot_parser.py. So, that's yet another parser bug; fun!

ferdnyc · 2024-11-08T16:43:55Z

I suspect it was defined that way partly because ID was also used for both sides of the port definition, and the part of port after the colon can't be a floating-point value.

Actually, that's not true — the part immediately after the colon can be, the compass_pt that follows it can't. The fact that ID is used to parse compass_pt, when compass_pt can only ever truly be one of ~9 fixed strings, is definitely a legacy parser bug. It seems like whoever wrote that part of the grammar just punted on the port definition.

ferdnyc · 2024-11-08T16:52:55Z

Part of the issue is that dot_parser.py wasn't based on the graphviz grammar presented in the current documentation; it was based on the one from this PDF (which, despite its 2015 datestamp, I believe dates back about a decade earlier) — you'll notice it much more closely resembles the logic in dot_parser, despite not matching the actual grammar used by graphviz today.

https://graphviz.org/pdf/dotguide.pdf#page=34

...id is never even DEFINED in that grammar. At all! It's given a vague and unhelpful textual description: "An id is any alphanumeric string not beginning with a digit, but possibly including underscores; or a number; or any quoted string possibly containing escaped quotes."

(We recently ripped out dot_parser's support for things like this:

port → port-location [port-angle] | port-angle [port-location]
port-location → ’:’ id | ’:’ ’(’ id ’,’ id ’)’
port-angle → ’@’ id

Because, despite being documented in the grammar, port-angle and the tuple form of port-location were never actually implemented anywhere in graphviz, so they were just complicating the parser for no reason.)

Due to this issue pydot/pydot#436

ferdnyc · 2024-12-15T18:30:29Z

Once I actually get it right (since it makes somewhat more sweeping changes to the parser definition, intended to shore up our support for unquoting quoted strings when possible — like graphviz itself does — among other things), this will also be fixed by #420.

Also, @loonatick-src, if you're using a patched pydot (which it appears that you are), and assuming it's the version from loonatick-src/pydot@bb33fbe, I'd highly recommend also picking up commits 83c6b87 and aff9908 from my WIP #420 PR, or reimplementing them in your branch.

Those two commits:

Enforce, in the parser, that identifiers must start with an alpha character or an underscore.
Correct the parser definition of float_number. The OneOrMore in the existing definition turned out to be a problem when parsing dotted-number strings (like version numbers). Because this:
```
float_number = Combine(
    Optional(minus) + OneOrMore(Word(nums + "."))
)
```
...will attempt to treat -3.4.5 or 1.2.3 as floats, which they most certainly are not. The updated definition uses:
```
float_number = Combine(
    Optional("-")
    + (
        "." - Word(nums)
        | Word(nums) + Optional("." + Optional(Word(nums)))
    )
)
```
...which is much more correct, as it will only permit these forms: #, -#, .#, -.#, #., -#., #.#, -#.# (where # is any series of digits, regardless of length).

The - joiner for grammar elements, instead of +, is also something I've been trying to use where appropriate — it represents a requirement that the element on the left, when encountered, must be followed by the element on the right.

IOW, when Optional("-") + ("." + Word(nums)) encounters something like -.a, it parses the -, followed by ., and then because the next token (a) doesn't match Word(nums), it'll backtrack and start trying other grammar rules to find one that can successfully parse all of the tokens in -.a.

By instead using Optional("-") + "." - Word(nums), it says that a . (optionally preceded by a -) must be followed by a token matching Word(nums), or the result is a parse error. Makes the grammar less ambiguous, since there's no scenario in which something like .red or -.= would be valid input in the same location where a float is accepted.

(I have another commit that does the same thing with the = sign when parsing attributes.) In an attribute list, after an ID the = sign itself is optional (node1 [some_flag]; is a valid node statement), but if it's present it must be followed by another ID. And in attribute assignments, a non-optional = is also required to be followed by an ID. Which ensures the parser won't try to find some way to parse either of the inner lines of this graph as valid:

graph G {
  some_node [color=, style=box];
  rank=;
}

loonatick-src changed the title ~~dot_parser.py: Why does ID not have a float_number the float_number production rule?~~ dot_parser.py: Why does ID not have the float_number production rule? Nov 8, 2024

ferdnyc added bug parser labels Nov 8, 2024

loonatick-src added a commit to loonatick-src/graph-diff that referenced this issue Nov 8, 2024

fix: use my patch of pydot

c920ba9

Due to this issue pydot/pydot#436

ferdnyc linked a pull request Dec 6, 2024 that will close this issue

Enhance parser and name-lookup with better quoting #420

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dot_parser.py: Why does ID not have the `float_number` production rule? #436

dot_parser.py: Why does ID not have the `float_number` production rule? #436

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dot_parser.py: Why does ID not have the float_number production rule? #436

dot_parser.py: Why does ID not have the float_number production rule? #436

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dot_parser.py: Why does ID not have the `float_number` production rule? #436

dot_parser.py: Why does ID not have the `float_number` production rule? #436