8000 GitHub - mongsvo/tyo: A utility for finding Typo-Bridges
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mongsvo/tyo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository aims to accelerate the process of finding Typo-Bridges, to foster interest and to encourage their intentional inclusion in texts.

What's a Typo-Bridge

Let's start with an example:

tet oyur safd lobely tyos

The sentence above contains five Typo-Bridges. It could be interpreted as an attempt to write (at least) one of the following:

test our safe lovely toys

get your sad lonely typos

That is, assuming English, the five alphabetic sequences ("words") above all appear to be typos, and assuming an ANSI keyboard with QWERTY layout was used, they all could be interpreted in at least two different ways. For example:

test -> tet <- get :

If the author was a bit tired, they might have missed the s, intending to write test. However, since t is close to g on their keyboard, it's also possible they meant to write get.

our -> oyur <- your :

It could be that they swapped o and y, intending to write your, or accidentally pressed y while typing u, which is right next to it, intending to write our.

ANSI keyboard with QWERTY layout
Image by Denelson83 from Wikipedia, used under CC BY-SA 3.0 license.

We consider the following:

  • Keys neighboring the ones intended to be pressed are somewhat likely to be accidentally pressed instead or in addition to the intended ones.
  • Two consecutive elements in the word intended could have been entered in the wrong order.
  • Some keys intended to be pressed may have been left unpressed.

Quite essy, isn't it? Some explanations (words) may be considered, by somebody, less likely than others. Some mistakes may be considered, by someone, more likely to happen than others. Typo-Bridges is about plausibility, it's subjective and sensitive to the circumstances contemplated by the observer. They sit between words, between explanations, keeping them all considerable and roughly with the same likelihood.

Installation

You can get started by

git clone https://github.com/mongsvo/tyo.git

cd tyo

cargo build --release
# or, if Faiss installed
cargo build --release --features neighbors

To compile with the optional "--features neighbors" you first have to install Faiss.

The resulting command line tool tyo comes with a default device (QWERTY-ANSI) but no default language. tyo can work without any language installed (see usage bellow), but if you want to scan a language for all possible bridges to a word, you must first install one. For this purpose, after compilation, you can use ./target/release/install_lang, as well as the helper python script at ./contrib/embeddings_filter.py to clean up your datasets before installing.

As an example, for English, you could use one of these datasets and clean it a bit by using this much smaller dataset:

wget https://nlp.stanford.edu/data/glove.42B.300d.zip

unzip glove.42B.300d.zip

wget https://raw.githubusercontent.com/dwyl/english-words/refs/heads/master/words_alpha.txt

Assuming you compiled with Faiss, you could then run:

./target/release/install_lang english glove.42B.300d.txt 300 words_alpha.txt

# 300 is the dimension of the embeddings
# Only words in words_alpha.txt will pass in

If you didn't compile with Faiss, no problem, doing the same without "300 words_alpha.txt" should work, although you may want to pass it only "english words_alpha.txt" instead, as you don't need the extra data (vector representations) from "glove.42B.300d.txt".

./target/release/install_lang english words_alpha.txt

# No Faiss

Note: The datasets referenced above are external resources, and I am not responsible for the availability, accuracy, legality, or any other issues related to these external datasets. It is your responsibility to verify and review the terms of service, licenses, and any potential restrictions imposed by the original authors or providers of the datasets before downloading or using them.

Usage

Once compiled, you can find bridges between pairs of words like this:

./target/release/tyo

>>> toy typo
(typo, toy): tyo

# Loads the default language and device.
# Prints all found Typo-bridges between them.

If you compiled with Faiss and installed a language, you could use it like this:

./target/release/tyo -l english -d qwerty_ansi

>>> 100: food sleep
(waking, baking): aking
(bed, bread): bred
...

# Loads English and the qwerty_ansi device (the default), finds the 100 semantically closest words to `food` and `sleep` and prints all Typo-Bridges between them.

Without Faiss you could still do this:

./target/release/tyo -l english spanish

>>> l: friend
...
(friend, fired): fried
...
(friend, riendo): riend
...

# Loads English and Spanish with the default device and prints all bridges from "friend" to any word in English and Spanish.

You could also go the other way around, giving it a typo and letting it search for any words bridged by this typo in the loaded languages:

./target/release/tyo

>>> t: lobely
lobely -> lovely, lonely

# Loads the default language (English in this case) and device.
# Prints all words in the default language that are bridged (to some other word) by "lobely".

tyo tries to be inclusive, to inform you about typos people may consider imaginable. It's up to you to decide which to use.

For more information enter:

./target/release/tyo --help

Further Discussion

Definition

For a device D and a language L, a Typo-Bridge is a sequence of elements T that all could be produced by a performer using D, is not in L, and for which there are two other sequences A and B, both in L, such that T could be explained as an unsuccessful attempt to produce A or B using D, yet both A and B are estimated (by the observer) to have roughly the same likelihood to be the sequence originally intended by the performer. That is, assuming D, the observer would perceive T as unintentional, would not be able to tell which, A or B, was intended, yet recognize both A and B as likely to be.

Pseudo-Bridges

By the above definition, if T is in the language, it is not a Typo-Bridge. We may consider it a Pseudo-Bridge and in some contexts, where T, despite being in the language, would be considered unexpected by the observer, T may play a similar role to a real Typo-Bridge.

Alternative Devices & Languages

The name Typo-Bridge suggests typographical sequences, but Typo-Bridges may appear as other sequences as well, e.g. a phonological sequence with an assumed speech production system and a given language.

Note that in Typo-Bridges errors are assumed to happen during the production and not reproduction or in observation, due to noise in the environment.

Long Bridges

The typos considered by tyo are one mistake away from A and one from B. Typo-Bridges may also be Sequences where the path to A and B is longer. However, tyo would be more complicated to code and the results may be a bit too confusing for the typical human observer.

Deniability

Interestingly, beyond the artistic value of Typo-Bridges, they can be used to say potentially controversial things. A bit like dog whistling, they create a shield of plausible deniability, leaving the door open for you to escape. Speaking without spaking!

Room for Error

Using autocorrection may eliminate Typo-Bridges, as it converts all strings into words. Sometimes however it converts typos into the wrong words, potentially leading to Pseudo-Bridges. With the emergence of powerful language models, it might seem like we are losing the room for error. It could be however, that we are simply shifting to another one. Certainly we are losing typos, the more traditional ones, but autocorrection and language models still are instruments, and like all instruments, used by an imperfect performer, a room for error is thankfully inevitable.

Contribute

Yes! The most straightforward way to contribute would be to enrich the repository with more devices. Also, if you found bugs, or typos you believe tyo should have shown in response to your query but didn't, please open an issue.

Future Work

Aside from optimization and adding devices, it would be nice to enhance tyo with a language model, so tyo will fork your sentences while you write them, suggesting bridges based on the earlier words written in all branches.

Also, if there is an interest I may provide a SaaS, to save you downloading and installing dictionaries and models.

License

Copyright (C) Mongke Svoboda

Licensed under the GNU AGPLv3

Contact

You can reach me via email:

[a Typo-Bridge between "permission" and "emission" assuming ANSI keyboard with QWERTY layout] @ proton.me

I also have a PGP key:

FE728494E2A7341ECA2602E10FD24BC5F657507A

About

A utility for finding Typo-Bridges

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0