med-mastodon.com is one of the many independent Mastodon servers you can use to participate in the fediverse.
Medical community on Mastodon

Administered by:

Server stats:

343
active users

#unicode

6 posts6 participants0 posts today

Here is a quick lesson in UTF-8.

UTF-8 encodes the entirety of Unicode in chunks of 1 to 4 bytes.

If you are given a random index within a string encoded in UTF-8, you can find the beginnings of the characters before and after that index.

If the current byte is an ASCII character, then it is an ASCII character.

#linux#bsd#unix

Hei #DigitalScholarlyEditions #TEIXML crowd!

🙋 I have a question for you: have you ever used the #Unicode symbol for insertion of a character or word into a text, U+2380 ⎀?

Or how would you encode the insertion mark in a text?

In the #OscarMamen pocket diaries, the insertion mark is used frequently and I would like to encode it somehow.

Suggestions? Examples?

I feels that I cannot get any good results googling anymore: #Enshittification of the search engine, I guess...

Newly covered #Unicode code points in #iOS 26.

I have to admit I have not updated anything to 26 yet. At least on Mac I usually wait for #MacPorts issues to be cleared up, but this one might take me a while...

㇀㇁㇂㇃㇄㇅㇆㇇㇈㇉㇊㇋㇌㇍㇎㇏㇐㇑㇒㇓㇔㇕㇖㇗㇘㇙㇚㇛㇜㇝㇞㇟㇠㇡㇢㇣㇤㇥𞓐𞓑𞓒𞓓𞓔𞓕𞓖𞓗𞓘𞓙𞓚𞓛𞓜𞓝𞓞𞓟𞓠𞓡𞓢𞓣𞓤𞓥𞓦𞓧𞓨𞓩𞓪𞓫𞓮𞓯𞓬𞓭𞓰𞓱𞓲𞓳𞓴𞓵𞓶𞓷𞓸𞓹𠁣𠃛𠊎𠖄𠖫𠗻𠘆𠜖𠞩𠞭𠠃𠠝𠠫𠢕𠴭𠺅𠺣𠻞𡌴𡟓𡨞𡳞𡽜𢄧𢎙𢒉𢓜𢛟𢜳𢬳𢯭𢯾𢱤𢲴𢳪𢶀𢺴𢻷𢼌𢼛𢿞𣁳𣍐𣗺𣦼𣩈𣮈𣲩𣸤𣼎𤁢𤊶𤍒𤐙𤐰𤖯𤘅𤞚𤡯𤲍𤶃𤸁𤺅𤺪𤿎𥉔𥌚𥍉𥏘𥐵𥯟𥯥𥰔𥴊𥽕𦃓𦉎𦊓𦒨𦘅𦜆𧉅𧉟𧌄𧜞𧩣𧮙𧰵𧺤𧻴𧿳𨂿𨅔𨒇𨢑𩏠𩑾𩔵𩚨𩛩𩜄𩜇𩜰𩟗𩣳𩨑𩵱𩸙𩼧𪀋𪐞𪖐𪖶𪘒𪜶𪢼𪳕𪹚𫓩𫜼𫜽𫝏𫝘𫝙𫝞𫝺𫝻𫞭𫞼𫟂𫟊𫟧𫠄𫠛𫣆𫰡𬈜𬏛𬠖𬤐𬦰𬬺𬮤𮀎𮣳𮭦𮯴𰣻𰵝𰵞𰵧𰹬𰾫𱂐𱮒𱱿𱳪𲂎𲓖

Any guesses why macos Characterviewer lists guillemetleft and guillemetright under “Parentheses”? The guillemets share their Unicode category Pi / Pf with quotedblleft and quotedblright. The guillemets are missing from “Punctuation” and even “Punctuation - All”. Are these really used as parens? Seems like a personal selection. #macos #unicode #categories

«»“”()[]
guillemetleft Pi
guillemetright Pf
quotedblleft Pi
quotedblright Pf
parenleft Ps
parenright Pe
bracketleft Ps
bracketright Pe

Continued thread

IMO the reason most people don't know that there are official guidelines on what #Unicode codepoint sequences constitute a valid identifier is because languages largely don't bother to even discover that the standard exists, let alone implement it.

#Python is an exception to the rule, it has had UAX#31 support since Python 3.0¹²

C++ has switched over to this standard as of C++23 although I do not know all of the details. Fun fact: gcc and Clang are both perfectly happy to let you use a zero-width space in an identifier in earlier versions of C++.

¹ docs.python.org/3.0/reference/
² see PEP 3131 for historical details: peps.python.org/pep-3131/

docs.python.orgLexical analysis — Python v3.0.1 documentation
Continued thread

Most people don't really know that the #Unicode Consortium publishes extremely well-defined guidelines on identifiers.

unicode.org/reports/tr31

The most familiar example is the sort who has a whole soapbox rant about how emoji are bad.

If that Hypothetical Guy makes reference to how some languages allow emoji in identifiers, they may not realize how much of their ass they are baring to the world, because in fact, the default guidelines don't allow them! The languages they are bitching about as a rule are not following the Unicode Consortium's guidance on identifiers, but some ad-hoc rules which are usually missing large swaths of conventional wisdom.

One of the cool things about UAX#31 though is that it allows you to create custom "profiles", changing up the rules a bit about what is or is not valid in an identifier to your own liking without entirely discarding the valuable wisdom of people who spend their professional lives thinking about these Hard Problems.

Anyway, the distinction between Recommended Scripts and Limited Use Scripts is along similar lines:

Recommended Scripts are the sort that UAX#31 thinks you probably should implement because they are "in widespread modern customary use".

Limited Use Scripts are ones that are less "encouraged" and which you might want to disallow as an implementer of the standard. It's not that they're disallowed, but they're not being *encouraged*.

For the sake of completeness, there are also Excluded Scripts which as the name suggests are recommended *against* because they are archaic/etc.

www.unicode.orgUAX #31: Unicode Identifiers and Syntax
Continued thread

#Unicode 17.0.0 includes a bunch of script updates, some fun new symbols — including astronomical symbols for asteroids!

My favorite change is a fairly obscure tweak to Unicode Standard Annex #31, which defines guidelines and requirements for which codepoint sequences form valid identifiers (e.g. for programming languages)

The Bopomofo script (Zhùyīn fúhào) for phonetically representing Chinese was previously considered a "Recommended script" but has been recategorized as a "Limited use script" because it is generally used in educational contexts.

Je viens de passer une excellente après-midi au parc à discuter avec un inconnu des Béri (ceux dont le système d'écriture est "l'écriture des chameaux", de l'Unicode, des langues sahariennes, de l'hobyot et de ce genre de choses.

Maintenant il faut que je lui précise comment ajouter un système d'écriture à l'Unicode ou des caractères.

@bortzmeyer

I just learned how to type unicode letters and dingbats in Linux!

Ctrl + Shift + U press all 3 keys at once then let all three letters go.

then type in the unicode and press enter.

en.wikipedia.org/wiki/List_of_

IE.

Ctrl + Shift + U 2713 is a tick or check mark

Similarly, I can write ñ (n tilde) with:

ctrl + shift + U 00f1

See dingbats block for more check mark choices.
en.wikipedia.org/wiki/Dingbats

All of unicode here:
home.unicode.org/

en.wikipedia.orgList of Unicode characters - Wikipedia