[wp-trac] [WordPress Trac] #30130: Normalize characters with combining marks to precomposed characters

WordPress Trac noreply at wordpress.org
Tue Jan 21 05:09:23 UTC 2020


#30130: Normalize characters with combining marks to precomposed characters
------------------------------------+-----------------------------
 Reporter:  zodiac1978              |       Owner:  SergeyBiryukov
     Type:  enhancement             |      Status:  reviewing
 Priority:  normal                  |   Milestone:  5.4
Component:  Formatting              |     Version:
 Severity:  normal                  |  Resolution:
 Keywords:  dev-feedback has-patch  |     Focuses:
------------------------------------+-----------------------------

Comment (by a8bit):

 Replying to [comment:46 zodiac1978]:
 > Replying to [comment:45 a8bit]:
 >
 > That shows IMHO exactly why everything **should be** normalized to NFC.
 Because then we have a common ground. macOS is using NFD (decomposed
 characters) internally and that's why Safari does normalize files on
 upload. But Chrome/Firefox are not doing this. We could wait for the
 browsers to fix it or we can fix it in WordPress.
 >
 IMO it shows that everything **should be** normalized, just not
 necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is
 described by Unicode as for legacy systems. The future appears to be NFD.

 >
 > That's correct, because the filesystem itself (HFS+ and APFS for
 example) are using NFD and not NFC.
 >

 This means if all text in WordPress is normalized to NFC any file
 comparisons with files on APFS that have multi-byte characters is going to
 fail.

 I solved my problem today by writing a function to check the existence of
 files using both forms, doubling the file io's in the process. Not exactly
 optimal.

 > Windows doesn't force decomposition and I don't think you should do this
 and I can't find your source on MSDN if I google this text. Can you please
 share the link, so that I can check the source myself?

 It was quoted as a source on the wikipedia page for precomposed characters
 http://msdn.microsoft.com/en-us/library/aa911606.aspx

 > Agreed, but what would be the alternative? We could check and warn the
 user, as this is recommended by the document. But as the module with the
 needed function is optional that wouldn't be very reliable:

 The alternative would be NFD.

 > or we could normalize locale-specific, because the biggest problem seems
 to be that other languages may have a problem with normalization:

 That would be great if no one ever read a website outside of their own
 country

 > I think there are not many cases where you will really need NFD text.
 The advantages of a working search, working proofreading, etc. are
 outweighing any possible edge cases where the NFD text is needed.

 They said that about 4-digit years ;)

 I could mention that search and sort becomes more flexible with NFD
 because you can now choose to do those things with and without the
 compound characters, I don't see how proofreading is improved with NFC?

 > I am still recommending to get this patch in and then see what breaks
 (if something breaks).

 I hope it all goes well, I don't have any skin in this game I was merely
 flagging up one of the edge cases I actually hit today in case no one had
 thought of it. Apple not allowing NFC is going to cause issues for
 international macOS users when comparing source and destination data, it
 remains to be seen how big of an issue that will be but I accept it's
 likely to be quite small.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/30130#comment:47>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list