Enhance parser and name-lookup with better quoting #420

ferdnyc · 2024-09-20T18:18:36Z

This is Pydot 4.0 work for sure, since its changes are pretty sweeping

Stripping quotes during parsing

In #413 I mentioned that it would enhance round-tripping of our graphs through dot and back again if the parser unquoted input strings where possible. This PR starts the work of doing that.

Because not all strings can be unquoted, instead of enabling unquote_results in pyparsing.QuotedString(), a new parse action possibly_unquote is attached to the double_quoted_string parser element that handles making that decision. It makes sure that any quoted strings stay quoted that need to: double-quoted strings containing what would otherwise parse as an HTML-like string, or as an ID with port values, etc. If it will confuse the pydot code when sent in unquoted, it's left quoted. Otherwise, it's unquoted, which matches dot's own behavior in parsing.

In addition, the lookup functions get_node(), get_subgraph(), and get_edge() are enhanced to also try the quoted versions of their ID arguments, if passed unquoted strings.¹ So, we're better protected both coming and going.

The make_quoted utility function will no longer translate newlines and carriage returns into \n and \r — it's unnecessary, as dot quoted strings can contain the raw control characters without issue, and our parser can consume them as well. So, leaving them be will further enhance our ability to round-trip graphs through graphviz commands with better fidelity.

Fixes #413

Notes

For get_edge() in particular, that became extremely complex, because each endpoint has to be separately checked for a possible quoted form, and tested with or without a quoted form of the other endpoint.
So, for example, get_edge('abc', 'def') will try:
- In a directed graph,
 1. ('abc', 'def')
 2. ('"abc"', '"def"')
- In an undirected graph,
 1. ('abc', 'def')
 2. ('"abc"', '"def"')
 3. ('def', 'abc')
 4. ('"def"', '"abc"')
But when one side is already quoted, the quoted form of the other side still has to be tested as well. So, get_edge('<html>', 'def') will try:
- In a directed graph,
 1. ('<html>', 'def')
 2. ('<html>', '"def"')
- In an undirected graph,
 1. ('<html>', 'def')
 2. ('<html>', '"def"')
 3. ('def', '<html>')
 4. ('"def"', '<html>')
That code is ugly, but I dropped in a long-ass explanatory comment which will hopefully help.

Improved numeric parsing in `dot_parser`

Negative-value floats will be recognized as such — previously, we didn't allow a leading - sign.
Floats are included in the grammar for ID values, and are legal unquoted in all places where graphviz allows them.
The parser now correctly requires that unquoted identifiers start with a non-digit, meaning an alpha character or _. This was already true in the quoting logic, but now the parser does the right thing too.
The idea of a "right-hand ID" (that is, an ID after an equals sign, in an assignment) is eliminated — IDs are IDs are IDs. There was never supposed to be any difference in the grammar.

The upshot of all those changes is that, in main:

>>> import pydot.dot_parser as dp; gp = dp.graph_definition()
>>> # This was previously an error
>>> g = gp.parse_string(
...     'graph G { -5 [color=red, length=-4.5, width=5]; }')[0]
>>> # And this was previously...
>>> g = gp.parse_string(
...     'graph G { "-5" [color=red, length="-4.5", width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
"-5" [color=red, length="-4.5", width=5];
}

On this branch:

>>> import pydot.dot_parser as dp; gp = dp.graph_definition()
>>> g = gp.parse_string(
...     'graph G { -5 [color=red, length=-4.5, width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
-5 [color=red, length=-4.5, width=5];
}
>>> g = gp.parse_string(
...     'graph G { "-5" [color=red, length="-4.5", width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
-5 [color=red, length=-4.5, width=5];
}

Much more in line with graphviz.

Due to these changes, this PR now...

Fixes: #436

Move constants to `pydot.constants`

A new file constants.py is added to pydot, containing all of the CONSTANT_LIST_VARIABLES previously defined in pydot.core. (The actual reason for this move is, I also need the DOT_KEYWORDS list in dot_parser now.)

Because all of the constants in pydot.constants are currently imported into pydot.core, the __init__.py file's from pydot.core import * import will still pick them up, which means that things like pydot.NODE_ATTRIBUTES and pydot.OUTPUT_FORMATS will still be defined even after this change.

However, I'm tempted to change that (e.g. by having pydot.core use from pydot import constants, and refer to the constants as constants.CONSTANT_NAME), in order to get the constants lists out of the root pydot namespace. They were never really intended to be referenced from the pydot. level.

ferdnyc · 2024-09-20T18:26:55Z

Aargh, Well, that failure's for a completely different reason: I forgot you can't do this in Python 3.8:

with (
    open(some_file) as in1,
    open(some_file) as in2
):

(It can be done, but it can't be wrapped in parens and made multi-line until Python 3.10.)

dot_parser has always "cheated" our quoting logic by passing in values already quoted. That means, for all of the graphs we run through it in testing, it's not really testing our quoting logic at all. Remove the quotes from quoted values during parsing, so that our quoting logic is tested, and so that users who run graphs through `dot -Tdot` and parse them back in won't get data with all of the identifiers changed.

If TEST_ERROR_DIR is set when running the tests, and a failure is encountered in one of the file-driven tests, write out both the original file and the `.to_string()` for the parsed version.

ferdnyc · 2024-09-30T18:04:45Z

Oh! Turns out, the b53.dot failure was due to the missing . quoting from #419, since I started this branch before that was merged. As soon as I rebased this on master, the tests went back to all-green.

This still shouldn't be merged until we're sure we're ready to start development on Pydot 4.0.0, but it's no longer a WIP as it's now passing all tests, same as main. (All adjusted tests, since the expected results of some of the API tests has to be modified for the new get_foo() logic.)

ferdnyc · 2024-09-30T18:06:11Z

Oh, right, but I still can't do this in Python < 3.10.

with (
    open(some_file) as in1,
    open(some_file) as in2
):

github-actions · 2024-09-30T18:24:20Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/pydot
constants.py
core.py					1007, 1038, 1068, 1134
dot_parser.py					421
utils.py					30, 94, 109, 123, 125, 127, 143
test
test_api.py
test_pydot.py					49, 54
Project Total

_{This report was generated by python-coverage-comment-action}

github-actions · 2024-10-06T18:17:30Z

This pull request has conflicts, please resolve those so that the changes can be evaluated.

github-actions · 2024-10-16T13:15:36Z

All conflicts have been resolved, thanks!

As long as the strings are quoted, both graphviz and our parser can handle embedded newlines, so there's no reason to escape them. Makes round-trips through graphviz commands less disruptive.

github-actions · 2024-11-19T18:00:39Z

This pull request has conflicts, please resolve those so that the changes can be evaluated.

ferdnyc · 2025-01-29T06:51:12Z

I think this might be done. Even if there's more I could do, it's big enough already that I'd rather save anything further for a separate PR.

Oh, I need to add tests for the whole make-sure-the-parser-doesn't-unquote-keyword-names thing I spotted, and fixed, above. Which I'd rather put in test_parser.py, so I think I'll wait until #452 is merged (which creates that file).

github-actions · 2025-01-29T08:39:04Z

This pull request has conflicts, please resolve those so that the changes can be evaluated.

github-actions · 2025-03-13T00:14:20Z

All conflicts have been resolved, thanks!

github-actions · 2025-04-05T11:04:39Z

This pull request has conflicts, please resolve those so that the changes can be evaluated.

github-actions · 2025-04-05T17:11:28Z

All conflicts have been resolved, thanks!

This way they can be shared between core and dot_parser.

ferdnyc · 2025-04-05T17:59:48Z

I think this branch is now good, but I'm not sure how reviewable it is anymore. I may have to re-build it on top of the current main branch HEAD, to make it comprehensible.

github-actions · 2025-04-13T17:41:28Z

This pull request has conflicts, please resolve those so that the changes can be evaluated.

Peiffap · 2025-05-29T18:30:35Z

Hi! I've been going through the blame for the fix to #396, and I've ended up on this draft PR. I'm looking at this stuff because in networkx/networkx#7648, someone found that long node names can break stuff. My understanding is that the changes in this PR should enable us to fix that issue as well (and that until they land, there's not much we can do), but I'm nowhere near qualified enough to say for sure. If you have the time, could you take a quick look at that?

lkk7 · 2025-05-30T07:05:34Z

@Peiffap I'm surprised to see this issue hanging around for a year, it was never reported here in pydot. Thanks for reporting.

I'm not sure how up-to-date this PR is. It could be outdated.
I'll personally only have time to look at the problem on Saturday/Sunday, but maybe @ferdnyc will see it more quickly

And I'm assuming you've tried with the newest pydot (v4), right?

Peiffap · 2025-05-30T15:56:03Z

Yes, I'm on v4.0.0.
No problem! Do you want me to create an issue for this? (Edit: #483.)

I'm not sure how up to date it is either - but at least GitHub thinks it can be cleanly merged, which is a nice start.
In the meantime, I've found a pretty minimal workaround/fix, in networkx/networkx#8079, but I'm not sure the "graph nodes are the last nodes in Q.get_nodes(), in their original order" hypothesis always holds. Perhaps one of you can take a look and let me know when you have the time!

lkk7 · 2025-05-30T16:53:10Z

Yes, you can create an issue, or I'll create it, whatever. Thanks!

ferdnyc marked this pull request as draft September 20, 2024 18:18

ferdnyc mentioned this pull request Sep 20, 2024

Bug in quoting node names with periods in v3.0.1 #418

Closed

ferdnyc added 5 commits September 30, 2024 07:29

get_foo(): if name not found, also try quoted

f632aa6

Adjust tests for id-lookup changes

a691ace

tests: Also output dot sources, on failure

5b68dee

If TEST_ERROR_DIR is set when running the tests, and a failure is encountered in one of the file-driven tests, write out both the original file and the `.to_string()` for the parsed version.

Cleaner quote-char handling

1037179

ferdnyc force-pushed the parser-suppress-quotes branch from 1382663 to 1037179 Compare September 30, 2024 11:29

ferdnyc marked this pull request as ready for review September 30, 2024 18:04

ferdnyc added 3 commits September 30, 2024 14:08

Python 3.[89]-compatible with statement

4d4fa87

Python 3.8-compatible type annotations

3e117b1

Ruff fixes

011f5ff

ferdnyc added 2 commits September 30, 2024 14:26

Shorter variable names

31b8e13

8000

Remove dead code (boost coverage)

c65b685

github-actions bot added the conflicts label Oct 6, 2024

Merge branch 'main' into parser-suppress-quotes

2abce23

github-actions bot removed the conflicts label Oct 16, 2024

ferdnyc added 2 commits October 16, 2024 09:26

Add type annotations to quoting utility functions

780fc94

8000
Don't escape CR/LF characters in to_string()

2247170

As long as the strings are quoted, both graphviz and our parser can handle embedded newlines, so there's no reason to escape them. Makes round-trips through graphviz commands less disruptive.

ferdnyc force-pushed the parser-suppress-quotes branch from d3fd169 to 2247170 Compare October 16, 2024 14:53

ferdnyc mentioned this pull request Oct 17, 2024

Coordinate of each nodes ? #425

Closed

github-actions bot added the conflicts label Nov 19, 2024

Merge branch 'main' into parser-suppress-quotes

12c2616

parser: Ensure we don't unquote keywords

a1f522f

ferdnyc force-pushed the parser-suppress-quotes branch from 955c43e to a1f522f Compare January 29, 2025 04:47

Ruff fixes

61a29b3

github-actions bot added the conflicts label Jan 29, 2025

ferdnyc mentioned this pull request Jan 29, 2025

tests: Also output dot sources, on failure #454

Merged

ferdnyc mentioned this pull request Mar 12, 2025

Bug: pydot fails to parse a legal dot file #460

Closed

Merge branch 'main' into parser-suppress-quotes

e716c93

github-actions bot removed the conflicts label Mar 13, 2025

github-actions bot added the conflicts label Apr 5, 2025

Merge branch 'main' into parser-suppress-quotes

6b477ca

github-actions bot removed the conflicts label Apr 5, 2025

ferdnyc added 5 commits April 5, 2025 13:19

Adjust test for new quoting rules

2636142

Move utility functions into pydot.utils

652b7c9

This way they can be shared between core and dot_parser.

Small fix to del_edge

5d4d624

Fix get_edge format combinatorics

aab8ec6

Ruff fixes

270cf95

github-actions bot added the conflicts label Apr 13, 2025

Peiffap mentioned this pull request May 30, 2025

Workaround for long node names in pydot_layout networkx/networkx#8078

Closed

lkk7 mentioned this pull request Jun 1, 2025

Long node names get mangled by create_dot #483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance parser and name-lookup with better quoting #420

Enhance parser and name-lookup with better quoting #420

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enhance parser and name-lookup with better quoting #420

Are you sure you want to change the base?

Enhance parser and name-lookup with better quoting #420

Uh oh!

Conversation

Uh oh!

Stripping quotes during parsing

Notes

Improved numeric parsing in dot_parser

Move constants to pydot.constants

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Coverage report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improved numeric parsing in `dot_parser`

Move constants to `pydot.constants`