[wp-trac] [WordPress Trac] #39791: sanitize_file_name() optimizations

WordPress Trac noreply at wordpress.org
Sun Feb 5 23:46:20 UTC 2017


#39791: sanitize_file_name() optimizations
-------------------------+------------------------------
 Reporter:  mgutt        |       Owner:
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Awaiting Review
Component:  Media        |     Version:  trunk
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+------------------------------
Changes (by SergeyBiryukov):

 * component:  General => Media


Old description:

> This changeset:
> https://core.trac.wordpress.org/changeset/29290
>
> added this line:
> {{{#!php
> $filename = str_replace( array( '%20', '+' ), '-', $filename );
> }}}
>
> But because of this changeset it can be removed as those chars aren't
> present anymore:
> https://core.trac.wordpress.org/changeset/35122
>

> '''Additional proposals'''
>
> 1.) After many years new special characters are added step-by-step to
> sanitize_file_name(). Now almost all characters of the reserved file
> system, reserved URI and unsafe URL characters lists are part of it,
> except of:
>
> reserved file system chars
> (https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words)
> {{{
> chr(0), ..., chr(32)
> }}}
>

> the reserved URI char (https://tools.ietf.org/html/rfc3986#section-2.2):
> {{{
> @
> }}}
>

> the unsafe URL char (https://www.ietf.org/rfc/rfc1738.txt):
> {{{
> ^
> }}}
>
> non-printing DEL:
> {{{
> chr(127)
> }}}
>
> Finally you should add all these chars to avoid future bug reports:
> {{{#!php
> $special_chars = array(
>         // file system reserved
> https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
>         '<', '>', ':', '"', '/', '\\', '|', '?', '*',
>         // control characters http://msdn.microsoft.com/en-
> us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
>         // note: \t, \n and \r are chr(9), chr(10) and chr(13)
>         chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7),
> chr(8), chr(9), chr(10),
>         chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17),
> chr(18), chr(19), chr(20),
>         chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27),
> chr(28), chr(29), chr(30),
>         chr(31),
>         // non-printing character <DEL>
>         chr(127),
>         // non-breaking space
>         chr(160),
>         // URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
>         '#', '[', ']', '@', '!', '$', '&', "'", '(', ')', '+', ',', ';',
> '=',
>         // URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
>         '{', '}', '^', '~', '`'
> );
> }}}
>
> If you do that, do not forget to change this line:
> {{{#!php
> $filename = preg_replace( '/[\r\n\t -]+/', '-', $filename );
> }}}
>

> to that (because we replaced the other chars already):
> {{{#!php
> $filename = preg_replace( '/[ -]+/', '-', $filename );
> }}}
>
> and remove this line because we cover it already through chr(160):
> {{{#!php
> $filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );
> }}}
>
> Source: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
>
> 2.) mb_strtolower() could be used to raise windows/unix interoperability
> (when downloading ftp backups or moving the host) because of their
> different behaviour in case-sensitivity.

New description:

 This changeset: [29290]

 added this line:
 {{{#!php
 $filename = str_replace( array( '%20', '+' ), '-', $filename );
 }}}

 But because of this changeset it can be removed as those chars aren't
 present anymore: [35122]


 '''Additional proposals'''

 1.) After many years new special characters are added step-by-step to
 sanitize_file_name(). Now almost all characters of the reserved file
 system, reserved URI and unsafe URL characters lists are part of it,
 except of:

 reserved file system chars
 (https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words)
 {{{
 chr(0), ..., chr(32)
 }}}


 the reserved URI char (https://tools.ietf.org/html/rfc3986#section-2.2):
 {{{
 @
 }}}


 the unsafe URL char (https://www.ietf.org/rfc/rfc1738.txt):
 {{{
 ^
 }}}

 non-printing DEL:
 {{{
 chr(127)
 }}}

 Finally you should add all these chars to avoid future bug reports:
 {{{#!php
 $special_chars = array(
         // file system reserved
 https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
         '<', '>', ':', '"', '/', '\\', '|', '?', '*',
         // control characters http://msdn.microsoft.com/en-
 us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
         // note: \t, \n and \r are chr(9), chr(10) and chr(13)
         chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7),
 chr(8), chr(9), chr(10),
         chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17),
 chr(18), chr(19), chr(20),
         chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27),
 chr(28), chr(29), chr(30),
         chr(31),
         // non-printing character <DEL>
         chr(127),
         // non-breaking space
         chr(160),
         // URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
         '#', '[', ']', '@', '!', '$', '&', "'", '(', ')', '+', ',', ';',
 '=',
         // URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
         '{', '}', '^', '~', '`'
 );
 }}}

 If you do that, do not forget to change this line:
 {{{#!php
 $filename = preg_replace( '/[\r\n\t -]+/', '-', $filename );
 }}}


 to that (because we replaced the other chars already):
 {{{#!php
 $filename = preg_replace( '/[ -]+/', '-', $filename );
 }}}

 and remove this line because we cover it already through chr(160):
 {{{#!php
 $filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );
 }}}

 Source: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

 2.) mb_strtolower() could be used to raise windows/unix interoperability
 (when downloading ftp backups or moving the host) because of their
 different behaviour in case-sensitivity.

--

--
Ticket URL: <https://core.trac.wordpress.org/ticket/39791#comment:1>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list