7110 – Sorting by name sometimes doesn't work correctly in the folder where some included folders have Cyrillic filenames

Sorting by name sometimes doesn't work correctly in the folder where some inc...

Status:

RESOLVED: FIXED

Priority:
Medium

Severity:
minor

Product:
Thunar

Component:

General

Comments

Description jufofu 2011-01-16 22:49:56 CET

Created attachment 3358 
the screenshot

Sometimes when I open a folder in Thunar (item arrangement set to "By Name") I see it put some folders into the "second series" of arrangement.

For example. it goes 0-9, then A-Z, then А-Я (Cyrillic), and then... again 0-9, then A-Z, then А-Я! The folders (haven't ever seen it happen with files, only with folders) from the first and the second series aren't the same, but I have no idea about what defines which one does a folder belong to.

So, having both non-Cyrillic and Cyrillic folders in one folder may lead to having this bug. Workaround: change item arrangement to any other (e. g. "By Modification Date"), then change it back to "By Name".

Screenshot included: http://www4.picturepush.com/photo/a/4880597/img/4880597.png

Comment 1 Masato Hashimoto 2012-04-10 09:31:09 CEST

Created attachment 4311 
Screenshot in Japanese

This issue seems to always occur and jufofu's workaround doesn't work on thunar-git.
I get this issue in Japanese.
Attached is screenshot of thunar and bash.

Comment 2 Masato Hashimoto 2012-04-10 09:37:33 CEST

Created attachment 4312 
test sample

Attached is test sample files of Comment #1.
Each numbers following Japanese characters in file name are unicode codepoint of the ja characters.

Comment 3 Stephan Arts editbugs

2012-04-14 22:22:04 CEST

The problem seems to be caused by (the use of) glib.

g_utf8_get_char () returns '0' on the first character.

Comment 4 Stephan Arts editbugs

2012-04-14 22:30:35 CEST

To clarify: 'utf-8' ordering fails due to the problem described above.

Comment 5 Andrzej editbugs

2012-04-30 11:35:21 CEST

Created attachment 4375 
A fix.

Seems to work here.

I know nothing about Thunar internals so I can't guarantee that the patch is correct.

(thank you Hashimoto-san, greetings from Japan).

Comment 6 Stephan Arts editbugs

2012-04-30 15:22:51 CEST

This patch would sort the items as followed:

(test) John Doe.txt
あおい輝彦.3042-304a-3044-8f1d-5f66.txt
Alan Smithee.txt
一ノ瀬泰造.4e00-30ce-702c-6cf0-9020.txt
一条忠頼.4e00-6761-5fe0-983c.txt
一青窈.4e00-9752-7a88.txt
堀口雅也.5800-53e3-96c5-4e5f.txt
堀孝史.5800-5b5d-53f2.txt
朱謙之.6731-8b19-4e4b.txt

Did we run into another bug with the sorting algorithm?

Comment 7 Andrzej editbugs

2012-04-30 19:30:18 CEST

Created attachment 4377 
More fixes.

I've found two more bugs:
- comparison function should not return 0 ("equal"), even if we're using case insensitive sorting
- arguments of strcoll were truncated to a single character - strcoll doesn't like it and returns a different result than with a longer string.

The sorting order now closely follows behavior of strcoll, so if there are
any problems with it, they are likely coming from strcoll.

Comment 8 Andrzej editbugs

2012-04-30 19:55:21 CEST

Created attachment 4378 
More fixes.

Added one more bugfix - a check for filename length of otherwise identical utf-8 filenames.

Comment 9 Andrzej editbugs

2012-05-01 06:12:25 CEST

IMHO the code is ready to be used, I don't have anything else to add.

There are some remaining issues, which cannot be easily fixed:

1. 'A' < 'a' but 'ą' < 'Ą' - this is because former is coming from ascii code comparison, and the latter from strcoll.
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14039

  Possible solutions:
  - always use strcoll - gives a consistent ('a' < 'A' and 'ą' < 'Ą') ordering but is slower for ascii characters, especially in case insensitive mode.
  - just flip 'a-z' and 'A-Z' codes manually [1] (also gives 'a' < 'A' and 'ą' < 'Ą')
  - wait for http://sourceware.org/bugzilla/show_bug.cgi?id=14039 to be resolved (would give 'A' < 'a' and 'Ą' < 'ą' but that's very unlikely)

2. あ < a < あa < aa < あaa
Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14038

  No solution (but hopefully this will be fixed upstream). If fixed, then the workaround in the patch (g_strconcat) will not be necessary, so we can then improve performance a bit by removing it.

Comment 10 Andrzej editbugs

2012-05-01 06:34:08 CEST

Created attachment 4379 
Swap ascii codes a-z and A-Z

This is a patch implementing the solution 1.2 from comment #9. It's likely much faster than solution 1.1.

It *changes* the sorting order of ascii characters to make it consistent with the order of non-ascii ones.

Comment 11 Andrzej editbugs

2012-05-01 10:37:46 CEST

Got some feedback from glibc bugzilla

1. They recommend using strxfrm for converting the string so that it matches strcoll ordering during simple comparison.

   However, strxfrm itself is pretty heavy, if we wanted "proper" sorting we could simply switch to using strcoll on all strings. So, my suggestion is to use the patch swapping 'a-z' for 'A-Z' maybe not the prettiest but it does 90% of strxfrm at almost 0 cost.

2. Weird ordering of Japanese characters and our workaround - apparently there are no Japanese language definitions in iso14651_t1_common file, which means they are ignored in the first pass and handled in the second one.

   They said that the "workaround" is indeed a correct way of using strcoll as there might be other ignored characters.

   There was no indication whether Japanese definition will be added to the iso14651_t1_common file but the bug was not closed so I imagine that still on the table.

My conclusion:
Current patches are doing as much as we can without sacrificing performance in ascii case (otherwise we could switch to strcoll completely). Other errors are mostly caused by limitations of strcoll in glibc (possibly will be resolved later).

Comment 12 Andrzej editbugs

2012-05-01 15:16:11 CEST

Created attachment 4380 
sort using g_utf8_collate_key_for_filename()

After discussion on IRC we have decided to try the g_utf8_collate_key_for_filename() function. It doesn't support number sort (and there is no way to add it efficiently), but should do a better job at sorting, and can potentially be faster (sorting itself is done by a key comparison, cost of collation is unknown).

Comment 13 Andrzej editbugs

2012-05-01 15:27:56 CEST

Created attachment 4381 
plugged a memory leak in the previous patch

Comment 14 Masato Hashimoto 2012-05-02 07:23:52 CEST

(In reply to comment #13)
> Created attachment 4381 
> plugged a memory leak in the previous patch

Andrzej-san:

Sorry for late reply.
Your patch works fine!!
Thank you for your quick work in spite of the Golden Week :)

Comment 15 Algimantas Margevičius 2012-05-02 09:15:55 CEST

note: "ū" is in wrong place, it's between "j" and "k" but it should be in the end of alphabet(i looked to Maori, Hawaiian, Marshallese, Lithuanian, Livonian, Latvian and Cornish alphabets in all these alphabets that letter is in the end before "v" or "w").

Comment 16 Andrzej editbugs

2012-05-02 09:43:29 CEST

(In reply to comment #15)
> note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> the end of alphabet

Which patch are you using, and what's is your locale (LC_COLLATE)?

I've checked that with LC_COLLATE=POSIX "ū" is after "z"
I don't have Lithuanian locale installed so I can't check it here but different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8 "ū" is between "u" and "v")

Note that with patch #13 sorting is done by glib (and ultimately by glibc), so if you see are any errors they come either from an error in your system configuration or from a bug in these libraries (glibc).

Comment 17 Algimantas Margevičius 2012-05-02 09:51:20 CEST

(In reply to comment #16)
> (In reply to comment #15)
> > note: "ū" is in wrong place, it's between "j" and "k" but it should be in
> > the end of alphabet
> 
> Which patch are you using, and what's is your locale (LC_COLLATE)?
> 
> I've checked that with LC_COLLATE=POSIX "ū" is after "z"
> I don't have Lithuanian locale installed so I can't check it here but
> different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8
> "ū" is between "u" and "v")
> 
> Note that with patch #13 sorting is done by glib (and ultimately by glibc),
> so if you see are any errors they come either from an error in your system
> configuration or from a bug in these libraries (glibc).

i'm using patch from comment #13.
LC_COLLATE=C
in my system, variable "LANG" has value which you mentioning, in my case lt_LT.UTF-8

Comment 18 Andrzej editbugs

2012-05-02 10:18:07 CEST

Most likely there is no bug (if you use correct LC_COLLATE), or if there is, it is not in thunar.

Try this:
/close *all* thunar windows /
$ thunar -q
$ LC_COLLATE=lt_LT.UTF-8 thunar

If there are any problems tell me about it on #xfce (irc.freenode.net). Bugzilla is not a support forum.

Comment 19 bw.owlet 2012-08-14 04:27:06 CEST

The latest version of Thunar (1.4.0) incorrectly sorts contents of folders with cyrillic letters in files and folders names.

Here is the contents of one folder "sorted" by name (ascending).
Looking from top to bottom I see file names starting with...
* cyrillic upper letters
* digits
* cyrillic lower letters
* again digits
* again cyrillic lower letters
* english lower letters
* and again cyrillic lower letters

Concrete example of another folder "sorted" by name:
> голубь.txt
> иволга.txt
> аист.txt
> орёл.txt
> сова.txt

Absoluletly wrong order. The file with name that starts with 'A' is in the middle.

Cyrillic alpabet is:

Аа Бб Вв Гг Дд Ее Ёё Жж Зз Ии Йй Кк Лл Мм Нн Оо Пп Рр Сс Тт Уу Фф Хх Цц Чч Шш Щщ ЬЬ Ыы ЪЪ Ээ Юю Яя

Meanwhile, "ls -1" gives right order
> аист.txt
> голубь.txt
> иволга.txt
> орёл.txt
> сова.txt

PCManFM and other filemanagers give right sort order.

So, Thunar DOES NOT sort with "ls". Also, sort order in Thunar CAN NOT be changed using LC_COLLATE. It ignores this variable, but instead uses its own "mega-wise" algorithm.

What's the matter, guys?! Prior to version 1.4.0, everything was OK in Thunar.

Comment 20 xunhua.guo 2012-08-16 13:56:15 CEST

Same problem here, for Chinese filenames.

Comment 21 Nick Schermer editbugs

2012-10-03 12:00:58 CEST

Can people help here a bit with some test files?

Name the files the following way: $(name).$(expected_position).txt, so for example "аист.1.txt", "голубь.2.txt"

Talking about Cyrillic/non-Cyrillic here, Chinese. All that don't fit into [a-Z]

Comment 22 Nick Schermer editbugs

2012-10-03 12:37:57 CEST

And please mention the used LC_COLLATE.

Comment 23 bw.owlet 2012-10-03 14:31:08 CEST

My variants in Cyrillic:

Variant 1

> голубь.2.txt
> аист.1.txt

Variant 2

> вишня.4.txt
> груша.5.txt
> апельсин.2.txt
> банан.3.txt
> ананас.1.txt
> киви.6.txt
> лимон.7.txt
> яблоко.8.txt

Locale settings. All is English.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Comment 24 Andrzej editbugs

2012-10-03 22:40:27 CEST

Created attachment 4647 
test case

Sorting order as in nautilus. ls uses slightly different sort order (no numeric sort, non-alphanumeric characters). andrzejr/utf8_collate behaves mostly like nautilus except for the '#' sign.

Comment 25 Nick Schermer editbugs

2012-10-03 23:13:10 CEST

Created attachment 4648 
Updated patch

Slightly updated patch that peeks the case-folded name, if equal use the case-hash. Saves some hashing and memory.

We could reduce the hashing to do this on the fly in the thunar_file_compare_by_name function, but that's too much imho.

The special case in nautilus for '.' and '#' might be useful. Duno if there are locales that put other characters in front of '.'/'#'

IMHO we should keep the hidden no-case option

Comment 26 Andrzej editbugs

2012-10-03 23:55:05 CEST

Works well for me. Good idea with the optimization. Lazy hashing could make sense when other methods of sorting are used (e.g. by modification time) and only if they don't fall back to compare by name. IMHO benefit not worth the complexity.

I have no preference for special characters ("#", "."). I don't know why nautilus is treating them differently.

I also feel leaving case-sensitive option for POSIX locale users is OK. We should probably change the default to case insensitive sort, to avoid confusion.

Comment 27 Nick Schermer editbugs

2012-10-04 08:56:57 CEST

The default is already case-insensitive in Thunar, so that doesn't need to change.

Nautilus sorts 'hidden' files after the other names, instead of showing them first. GTK+ doesn't and there are also bugs for that in the gnome bugtracker: https://bugzilla.gnome.org/show_bug.cgi?id=358812

The change obviously fixed sorting locales, but are there also situations where Thunar does a better job?

Comment 28 Nick Schermer editbugs

2012-10-04 18:30:39 CEST

Pushed patch in 1fcb0e7 if there are sorting regressions please open a new bug.

Comment 29 Nick Schermer editbugs

2012-10-12 11:57:04 CEST

*** Bug 9218 has been marked as a duplicate of this bug. ***

Comment 30 Nick Schermer editbugs

2012-10-30 08:56:20 CET

*** Bug 3724 has been marked as a duplicate of this bug. ***

Bug #7110

Reported by:
jufofu

Reported on: 2011-01-16
Last modified on: 2015-01-26

Duplicates (2):

3724 thunar could respect LC_COLLATE for sort order (case sensitive)
9218 Thunar: incorrect sort order (at least with cyrillic letters in files and folders names)

People

Assignee:
Stephan Arts

CC List:
11 users

Version

Version:
1.0.2

Target Milestone:
1.2.0

Attachments

the screenshot (265.27 KB, image/png) 2011-01-16 22:49 CET , jufofu	no flags
Screenshot in Japanese (220.12 KB, image/png) 2012-04-10 09:31 CEST , Masato Hashimoto	no flags
test sample (504 bytes, application/x-bzip2) 2012-04-10 09:37 CEST , Masato Hashimoto	no flags
A fix. (1.63 KB, patch) 2012-04-30 11:35 CEST , Andrzej	no flags
More fixes. (8.16 KB, patch) 2012-04-30 19:30 CEST , Andrzej	no flags
More fixes. (9.15 KB, patch) 2012-04-30 19:55 CEST , Andrzej	no flags
Swap ascii codes a-z and A-Z (1.14 KB, patch) 2012-05-01 06:34 CEST , Andrzej	no flags
sort using g_utf8_collate_key_for_filename() (8.63 KB, patch) 2012-05-01 15:16 CEST , Andrzej	no flags
plugged a memory leak in the previous patch (9.84 KB, patch) 2012-05-01 15:27 CEST , Andrzej	no flags
test case (1.67 KB, application/x-gzip) 2012-10-03 22:40 CEST , Andrzej	no flags
Updated patch (10.74 KB, patch) 2012-10-03 23:13 CEST , Nick Schermer	no flags

Additional information