Created attachment 3358 the screenshot Sometimes when I open a folder in Thunar (item arrangement set to "By Name") I see it put some folders into the "second series" of arrangement. For example. it goes 0-9, then A-Z, then А-Я (Cyrillic), and then... again 0-9, then A-Z, then А-Я! The folders (haven't ever seen it happen with files, only with folders) from the first and the second series aren't the same, but I have no idea about what defines which one does a folder belong to. So, having both non-Cyrillic and Cyrillic folders in one folder may lead to having this bug. Workaround: change item arrangement to any other (e. g. "By Modification Date"), then change it back to "By Name". Screenshot included: http://www4.picturepush.com/photo/a/4880597/img/4880597.png
Created attachment 4311 Screenshot in Japanese This issue seems to always occur and jufofu's workaround doesn't work on thunar-git. I get this issue in Japanese. Attached is screenshot of thunar and bash.
Created attachment 4312 test sample Attached is test sample files of Comment #1. Each numbers following Japanese characters in file name are unicode codepoint of the ja characters.
The problem seems to be caused by (the use of) glib. g_utf8_get_char () returns '0' on the first character.
To clarify: 'utf-8' ordering fails due to the problem described above.
Created attachment 4375 A fix. Seems to work here. I know nothing about Thunar internals so I can't guarantee that the patch is correct. (thank you Hashimoto-san, greetings from Japan).
This patch would sort the items as followed: (test) John Doe.txt あおい輝彦.3042-304a-3044-8f1d-5f66.txt Alan Smithee.txt 一ノ瀬泰造.4e00-30ce-702c-6cf0-9020.txt 一条忠頼.4e00-6761-5fe0-983c.txt 一青窈.4e00-9752-7a88.txt 堀口雅也.5800-53e3-96c5-4e5f.txt 堀孝史.5800-5b5d-53f2.txt 朱謙之.6731-8b19-4e4b.txt Did we run into another bug with the sorting algorithm?
Created attachment 4377 More fixes. I've found two more bugs: - comparison function should not return 0 ("equal"), even if we're using case insensitive sorting - arguments of strcoll were truncated to a single character - strcoll doesn't like it and returns a different result than with a longer string. The sorting order now closely follows behavior of strcoll, so if there are any problems with it, they are likely coming from strcoll.
Created attachment 4378 More fixes. Added one more bugfix - a check for filename length of otherwise identical utf-8 filenames.
IMHO the code is ready to be used, I don't have anything else to add. There are some remaining issues, which cannot be easily fixed: 1. 'A' < 'a' but 'ą' < 'Ą' - this is because former is coming from ascii code comparison, and the latter from strcoll. Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14039 Possible solutions: - always use strcoll - gives a consistent ('a' < 'A' and 'ą' < 'Ą') ordering but is slower for ascii characters, especially in case insensitive mode. - just flip 'a-z' and 'A-Z' codes manually [1] (also gives 'a' < 'A' and 'ą' < 'Ą') - wait for http://sourceware.org/bugzilla/show_bug.cgi?id=14039 to be resolved (would give 'A' < 'a' and 'Ą' < 'ą' but that's very unlikely) 2. あ < a < あa < aa < あaa Reported upstream: http://sourceware.org/bugzilla/show_bug.cgi?id=14038 No solution (but hopefully this will be fixed upstream). If fixed, then the workaround in the patch (g_strconcat) will not be necessary, so we can then improve performance a bit by removing it.
Created attachment 4379 Swap ascii codes a-z and A-Z This is a patch implementing the solution 1.2 from comment #9. It's likely much faster than solution 1.1. It *changes* the sorting order of ascii characters to make it consistent with the order of non-ascii ones.
Got some feedback from glibc bugzilla 1. They recommend using strxfrm for converting the string so that it matches strcoll ordering during simple comparison. However, strxfrm itself is pretty heavy, if we wanted "proper" sorting we could simply switch to using strcoll on all strings. So, my suggestion is to use the patch swapping 'a-z' for 'A-Z' maybe not the prettiest but it does 90% of strxfrm at almost 0 cost. 2. Weird ordering of Japanese characters and our workaround - apparently there are no Japanese language definitions in iso14651_t1_common file, which means they are ignored in the first pass and handled in the second one. They said that the "workaround" is indeed a correct way of using strcoll as there might be other ignored characters. There was no indication whether Japanese definition will be added to the iso14651_t1_common file but the bug was not closed so I imagine that still on the table. My conclusion: Current patches are doing as much as we can without sacrificing performance in ascii case (otherwise we could switch to strcoll completely). Other errors are mostly caused by limitations of strcoll in glibc (possibly will be resolved later).
Created attachment 4380 sort using g_utf8_collate_key_for_filename() After discussion on IRC we have decided to try the g_utf8_collate_key_for_filename() function. It doesn't support number sort (and there is no way to add it efficiently), but should do a better job at sorting, and can potentially be faster (sorting itself is done by a key comparison, cost of collation is unknown).
Created attachment 4381 plugged a memory leak in the previous patch
(In reply to comment #13) > Created attachment 4381 > plugged a memory leak in the previous patch Andrzej-san: Sorry for late reply. Your patch works fine!! Thank you for your quick work in spite of the Golden Week :)
note: "ū" is in wrong place, it's between "j" and "k" but it should be in the end of alphabet(i looked to Maori, Hawaiian, Marshallese, Lithuanian, Livonian, Latvian and Cornish alphabets in all these alphabets that letter is in the end before "v" or "w").
(In reply to comment #15) > note: "ū" is in wrong place, it's between "j" and "k" but it should be in > the end of alphabet Which patch are you using, and what's is your locale (LC_COLLATE)? I've checked that with LC_COLLATE=POSIX "ū" is after "z" I don't have Lithuanian locale installed so I can't check it here but different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8 "ū" is between "u" and "v") Note that with patch #13 sorting is done by glib (and ultimately by glibc), so if you see are any errors they come either from an error in your system configuration or from a bug in these libraries (glibc).
(In reply to comment #16) > (In reply to comment #15) > > note: "ū" is in wrong place, it's between "j" and "k" but it should be in > > the end of alphabet > > Which patch are you using, and what's is your locale (LC_COLLATE)? > > I've checked that with LC_COLLATE=POSIX "ū" is after "z" > I don't have Lithuanian locale installed so I can't check it here but > different locales yield different results (e.g. with LC_COLLATE=pl_PL.UTF8 > "ū" is between "u" and "v") > > Note that with patch #13 sorting is done by glib (and ultimately by glibc), > so if you see are any errors they come either from an error in your system > configuration or from a bug in these libraries (glibc). i'm using patch from comment #13. LC_COLLATE=C in my system, variable "LANG" has value which you mentioning, in my case lt_LT.UTF-8
Most likely there is no bug (if you use correct LC_COLLATE), or if there is, it is not in thunar. Try this: /close *all* thunar windows / $ thunar -q $ LC_COLLATE=lt_LT.UTF-8 thunar If there are any problems tell me about it on #xfce (irc.freenode.net). Bugzilla is not a support forum.
The latest version of Thunar (1.4.0) incorrectly sorts contents of folders with cyrillic letters in files and folders names. Here is the contents of one folder "sorted" by name (ascending). Looking from top to bottom I see file names starting with... * cyrillic upper letters * digits * cyrillic lower letters * again digits * again cyrillic lower letters * english lower letters * and again cyrillic lower letters Concrete example of another folder "sorted" by name: > голубь.txt > иволга.txt > аист.txt > орёл.txt > сова.txt Absoluletly wrong order. The file with name that starts with 'A' is in the middle. Cyrillic alpabet is: Аа Бб Вв Гг Дд Ее Ёё Жж Зз Ии Йй Кк Лл Мм Нн Оо Пп Рр Сс Тт Уу Фф Хх Цц Чч Шш Щщ ЬЬ Ыы ЪЪ Ээ Юю Яя Meanwhile, "ls -1" gives right order > аист.txt > голубь.txt > иволга.txt > орёл.txt > сова.txt PCManFM and other filemanagers give right sort order. So, Thunar DOES NOT sort with "ls". Also, sort order in Thunar CAN NOT be changed using LC_COLLATE. It ignores this variable, but instead uses its own "mega-wise" algorithm. What's the matter, guys?! Prior to version 1.4.0, everything was OK in Thunar.
Same problem here, for Chinese filenames.
Can people help here a bit with some test files? Name the files the following way: $(name).$(expected_position).txt, so for example "аист.1.txt", "голубь.2.txt" Talking about Cyrillic/non-Cyrillic here, Chinese. All that don't fit into [a-Z]
And please mention the used LC_COLLATE.
My variants in Cyrillic: Variant 1 > голубь.2.txt > аист.1.txt Variant 2 > вишня.4.txt > груша.5.txt > апельсин.2.txt > банан.3.txt > ананас.1.txt > киви.6.txt > лимон.7.txt > яблоко.8.txt Locale settings. All is English. $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
Created attachment 4647 test case Sorting order as in nautilus. ls uses slightly different sort order (no numeric sort, non-alphanumeric characters). andrzejr/utf8_collate behaves mostly like nautilus except for the '#' sign.
Created attachment 4648 Updated patch Slightly updated patch that peeks the case-folded name, if equal use the case-hash. Saves some hashing and memory. We could reduce the hashing to do this on the fly in the thunar_file_compare_by_name function, but that's too much imho. The special case in nautilus for '.' and '#' might be useful. Duno if there are locales that put other characters in front of '.'/'#' IMHO we should keep the hidden no-case option
Works well for me. Good idea with the optimization. Lazy hashing could make sense when other methods of sorting are used (e.g. by modification time) and only if they don't fall back to compare by name. IMHO benefit not worth the complexity. I have no preference for special characters ("#", "."). I don't know why nautilus is treating them differently. I also feel leaving case-sensitive option for POSIX locale users is OK. We should probably change the default to case insensitive sort, to avoid confusion.
The default is already case-insensitive in Thunar, so that doesn't need to change. Nautilus sorts 'hidden' files after the other names, instead of showing them first. GTK+ doesn't and there are also bugs for that in the gnome bugtracker: https://bugzilla.gnome.org/show_bug.cgi?id=358812 The change obviously fixed sorting locales, but are there also situations where Thunar does a better job?
Pushed patch in 1fcb0e7 if there are sorting regressions please open a new bug.
*** Bug 9218 has been marked as a duplicate of this bug. ***
*** Bug 3724 has been marked as a duplicate of this bug. ***