Quantcast
Channel: User Kamil Maciorowski - Super User
Viewing all articles
Browse latest Browse all 663

Answer by Kamil Maciorowski for How do I find files by last day in a month? (or how to copy the latest file per month)

$
0
0

Solution

With GNU toolset:

find . -type f -exec sh -c 'LC_ALL=C stat --printf="%.Y|%y|%n\0" -- "$@"' find-sh {} + \| LC_ALL=C sort -zr  -t '|' -k 1,1 \| LC_ALL=C sort -zsu -t '|' -k 2.1,2.7 \| cut -d '|' -zf 3- \| tr '\0''\n'

Adjust the invocation of find (up to but excluding -exec) to your needs.


Explanation

  1. For each file that gets to -exec, LC_ALL=C stat --printf='%.Y|%y|%n\0' is run. Its output consists of lines like

    1711542530.762649374|2024-03-27 13:28:50.762649374 +0100|./path to/something

    where the first |-separated field is the time of last data modification, seconds since Epoch (with precision); the second field is the time of last data modification, human-readable. Each line is null-terminated, so newlines (if any) in the pathname should be safe. Only the first two | characters will matter later, they both come from the format for sure, so | (if any) in the pathname should also be safe (see the explanation of cut below).

    I used LC_ALL=C to make the format independent from your current locale. Note LC_ALL=C find … would affect find and everything it runs, in general this may be unwanted; so instead of -exec stat … I used -exec sh -c … and this way I was able to set LC_ALL=C only for stat.

  2. Then the first sort sorts lines according to the first |-separated field. Lines associated with files recently modified will end up first. Our format is so strict that the default way of sorting in the C locale will work.

  3. The second sort considers only the YYYY-MM (year-month) part of the second field (2024-03 in the example) and because of -u (--unique) it passes only one line per YYYY-MM. With -s (--stable) this is the line associated with the most recently modified file per YYYY-MM, because the first sort has already placed most recent files first.

  4. Then for each line cut prints |-separated fields from the 3rd one to the last one. In each line this is a pathname. Formally a pathname containing (one or more) | characters will form the 3rd, 4th and possibly later fields, but as the fields in the output will also be separated by |, the output will be the exact pathname anyway.

  5. Finally tr converts null bytes to newlines, just to make the output human-readable (but also potentially ambiguous).


Notes

  • If you want to process the result further, keep it in the form of null-terminated strings if possible. In other words: a tool expecting null-terminated strings (e.g. xargs -r0 …) and placed instead of tr is better than a tool expecting newline-terminated strings and placed after tr.

  • Linux timestamps are just numbers, without the notion of timezone. Your stat will "translate" them to your current timezone. In particular it will assign files to YYYY-MM according to your current timezone and this can give different results in different timezones. E.g. a file modified (no matter where) around 2024-04-01 00:00:00 UTC will be assigned to 2024-04 if you are in India (the file was modified when India had already experienced few hours of the new month), but to 2024-03 if you are in Mexico (the file was modified when Mexico had few hours of the old month yet to come).

  • You may wonder if we really need two sorts. At first glance we don't need %.Y from stat, sorting by the well-defined %y should be enough. Well, I it's not. Consider these two lines:

    1698542400.000000000|2023-10-29 02:20:00.000000000 +0100|./newer1698540000.000000000|2023-10-29 02:40:00.000000000 +0200|./older

    This example is in the Europe/Warsaw timezone. The newer file is indeed newer than the older file, seconds since Epoch show this and the order is like from our first sort: newest first. But if I sorted by the second |-separated field and tried to achieve "newest first", then it would appear the other way around. The truth is 02:40 for the older file happened before my clocks were set from 03:00 back to 02:00 due to the end of Daylight Saving Time that year; 02:20 for the newer file happened after. There is no ambiguity, the strings +0200 and +0100 carry the information; but sort does not understand the format. This is why in the solution we first sort by seconds since Epoch, then we use the second sort to pick the newest (and I mean really newest) file per YYYY-MM.

  • I think GNU date --reference can be used instead of stat to get the mtime of a file. I chose stat though.

  • If you are interested in the result for a certain YYYY-MM, place grep between find and the first sort. E.g. for 2024-02 it may be:

    LC_ALL=C grep -Zza '^[^|]*|2024-02'

    An empty result means there is no file modified that month. With such grep the entire solution should pass at most one null-terminated line to tr. More than one null-terminated line passed to tr means my solution is buggy.

    Your locale probably uses UTF-8, but in general pathnames may contain sequences of bytes that are invalid in UTF-8. I used LC_ALL=C grep -a, so grep should not complain.

  • find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?


Viewing all articles
Browse latest Browse all 663

Trending Articles