10268 – smarter file extension splitting

smarter file extension splitting

Status:

RESOLVED: FIXED

Priority:
Medium

Severity:
normal

Product:
Thunar

Component:

Renamer

Comments

Description Jeff Shipley 2013-07-26 23:52:15 CEST

Created attachment 5105 
smarter file extensions

Right now, Thunar does really simple (and often incorrect) file splitting by grabbing the last "." in the filename and treating everything after it as the extension. This ignores more complicated extensions like .tar.gz (or lots of things followed by .gz, .bz2, .z, etc).

This problem is impossible to solve without being able to know the user's intentions for a filename.

Here's a patch that creates a new method for hopefully smarter file extension splitting (that will be incorrect a little less often than the current simple implementation) in thunar-util and then uses that method for simple rename and bulk rename.

Comment 1 Nick Schermer editbugs

2013-07-27 02:54:13 CEST

Created attachment 5106 
Function that acts like strrchr

I don't really like the idea of the regex (regexes are slow), its also quite convenient if the function behaves like strrchr does (return a pointer in the string).

Only patched the rename dialog to show the functioning. Probably also wise to check it with non-utf8 names bit it think its safe although it does check raw strings.

Comment 2 Jeff Shipley 2013-07-27 07:56:12 CEST

Created attachment 5107 
Nick's patch with a few tweaks

I can see the point in making this faster, especially when it comes to bulk renaming files. I may try some tests to see how much faster it is.

It looks good. I tweaked a couple of things (test for extension being only ".", added some comments, applied it to bulk rename).

I've done a bit of testing with unicode characters. I'm going to come up with a good variety of filenames some time this weekend to test this.

Comment 3 Nick Schermer editbugs

2013-07-28 12:30:40 CEST

Pushed patch in 5e25c20 with some additional comments and remarks. Also fixed the selection in the properties dialog and new-file dialog.

Comment 4 Jeff Shipley 2013-07-29 01:41:51 CEST

Created attachment 5108 
Ignore dotfiles

Comment 5 Jeff Shipley 2013-07-29 01:44:37 CEST

I reset to master and built again and found a couple of problems.

For dot files with no extension (eg ".filename"), the entire filename gets treated as the extension. The simple renamer (where I did a lot of the testing), still selects the entire thing. It looks like it does this because it doesn't change the selection if the offset is 0.

Patch for this is "Ignore dotfiles" (attachment 5108 ).

Another problem is is that wide unicode characters in the secondary extension when testing compression extensions will not split properly. For example, "filename.שּשּ.gz" will split to "filename.שּשּ" and ".gz" ('שּ' is three bytes wide). 

If there's any expectation that wide unicode characters will be part of a file extension, this could be fixed by using g_utf8_pointer_to_offset() to calculate the extension length. This would be slower though, so it is probably fine as-is.

Here are some filenames I used to test:
Should match entire name:
.filename
.filename.
filename.

Should match 3 extensions:
.filename.templatefile.in.in
filename.something.in.in

Should match 2 extensions:
.filename.tar.gz
.filename.שּ.gz
.filename.שּשּ.gz
filename.asdfg.gz
.filename.templatefile.in

Should match 1 extension:
.filename.gz
filename.asdfgh.gz
filename.asd
filename.asdf
filename.asdfg
filename.asdfghij
.tar.gz

Comment 6 Nick Schermer editbugs

2013-07-29 10:20:03 CEST

I've pushed the patch to not match hidden names.

The fact that hidden chars are not matched is good, since all extensions are always unicode. So I think we're good here now.

Bug #10268

Reported by:
Jeff Shipley

Reported on: 2013-07-26
Last modified on: 2013-07-29

People

Assignee:
Jannis Pohlmann

CC List:
1 user

Version

Version:
unspecified

Target Milestone:
1.2.0

Attachments

smarter file extensions (7.81 KB, patch) 2013-07-26 23:52 CEST , Jeff Shipley	no flags
Function that acts like strrchr (3.48 KB, patch) 2013-07-27 02:54 CEST , Nick Schermer	no flags
Nick's patch with a few tweaks (4.71 KB, patch) 2013-07-27 07:56 CEST , Jeff Shipley	no flags
Ignore dotfiles (1.92 KB, patch) 2013-07-29 01:41 CEST , Jeff Shipley	no flags

Additional information