journal

Less well-known uses of curl

When it comes to make HTTP calls, I always use curl as it is a ubiquitous tool for the job.

Today, I discover that I can use curl for some other tasks.

copying files

curl supports the FILE protocol (file:/), therefore it is possible to “download” a file:

$ curl file:/path/to/some/large/file -o /the/destination/file
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2389M  100 2389M    0     0   339M      0  0:00:07  0:00:07 --:--:--  369M

See http://askubuntu.com/questions/17275/progress-and-speed-with-cp

querying LDAP

Normally, when I needed to query an LDAP server, ldapsearch is always the de facto tool though it may not be installed on some environment.

Nowadays, I tends to use Docker image for LDAP for the job:

$ docker run --rm --name ldap -it --entrypoint bash emeraldsquad/ldapsearch:latest

Inside the container, I use ldapsearch for querying:

$ ldapsearch -x -LLL  \
  -h ldap-test-server.example.com \
  -p 8081 \
  -D 'uid=admin,ou=system' \
  -w ${LDAP_PASSWORD} \
  -b 'ou=users,dc=example,dc=com'

Today, I learned that I can achieve the same task with curl:

$ curl -v \
    -u "uid=admin,ou=system":${LDAP_PASSWORD} \
    "ldap://ldap-test-server.example.com:8081/ou=users,dc=example,dc=com??sub?(objectclass=*)" 

This is really great as curl may come pre-installed on lots of environments whereas ldapsearch and Docker may not.

If you need a more sophisticated query, consider giving LDAP URL Format a read. It will explain the structure of the URL you could use with curl.

journal

Convert Git to Git LFS

There are some Git repositories in the company contain mostly binary files (Words, Excel, PDFs, etc). As Git is not designed to track binary files effectively, eventually the repository ends up pretty large (over 2GB) and will become a PITA on git clone.

In order to effectively solves this, switching a regular Git to Git LFS. This post aims to show you how to do it.

Prerequisites

  • Remote Git Server MUST support Git LFS (GitHub, GitLab and BitBucket all supports LFS)
  • git (>=2.27.0) and git lfs (>=2.11.0) MUST be installed.
  • A Bash 5.0 shell. (For Windows, Git Bash may work but it is recommended to use WSL v2).

Steps

In this post, BitBucket as the remote Git server.

Clone a bare repository

In order to effectively overwrite all the history of a Git history, it is required to clone the entire Git repository in its bare form.

IMPORTANT

Make sure you make a backup of the cloned directory.

$ git clone --mirror git@example.com:awesome/awesomely-heavy.git

Make a quick check to see the current size of the repository so that we can verify it again once its done.

$ git count-objects -vH
# Just an example, your result may vary.
count: 0
size: 0 bytes
in-pack: 1053
packs: 1
size-pack: 615.10 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Migrate binary files to be tracked by LFS (Rewriting history alerts)

IMPORTANT
The following process will purposely rewrite entire your Git history. All commits IDs will be changed.

# Include more file extensions which you want to track by LFS.
$ git lfs migrate import --everything --include="*.docx,*.pdf"

migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (xxx/yyy), done.
# ...lots of refs..omitted
migrate: Updating refs: ..., done

You may wonder “How do I know which extensions to include?”. The following script can help you.

# The following script will output the file extensions existing in your Git repository.
# It was tested on MacOS with the aforementioned `git` version.

$ git log --all --numstat \
    | grep '^-' \
    | cut -f3 \
    `# On your Linux (or WSL), you may want to use sed instead of gsed` \
    | gsed -r 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g' \
    | sort -u `# Up until this point, all full-path of files committed will be printed` \
    | xargs -I{} sh -c 'x="{}";echo ${x##*.}' \
    | sort \
    | uniq \
    | awk '{ print "*."$1 }' \
    | paste -sd, -

# example output
*.docx,*.jar,*.xlsx,*.xltm

Clean up the Git repository

At this point, all of your history should be rewritten by new commits. The old commits will effectively become orphans. Git by default will still keep them in the .git directory until an explicly prune is invoked.

# Trigger Git GC to run immediately to remove
# all orphan commits.
# This will effectively remove all binary files stored
# in your .git/objects.
$ git reflog expire --expire-unreachable=now --all
$ git gc --prune=now

Verify Changes

Now verify if your Git repository has become smaller.

# Print out the space usage of the current Git.
# Compare to the previous run, it should be wayyy smaller.
$ git count-objects -vH

count: 0
size: 0 bytes
in-pack: 1054
packs: 1
size-pack: 199.19 KiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

If you wonder where are all the files now. The du command can tell you.

$ du -d 1 .
# in this example output, the files are moved from objects to lfs.
232K    ./objects
8.0K    ./info
 72K    ./hooks
  0B    ./refs
1.0G    ./lfs
1.0G    .

Push to remote repository

It’s time to push the new Git repository into your remote one.

IMPORTANT
It is critical that all refs are pushed into the remote repository.

# The `--mirror` ensure all `refs` are pushed.
# @see https://git-scm.com/docs/git-push#Documentation/git-push.txt---mirror
$ git push --mirror --force

Contact your Git Hosting provider to run git GC

Successfully pushed the new Git LFS into the remote repository not the end of the story.

As Git is a distributed source version control, so your Provider also holds a copy of your entire Git repository. It’s important that your Provider run Git GC on their own infrastructure.

For BitBucket, you have to open a BitBucket Cloud Support ticket and request them to run git GC. GitLab automatically runs gc on each push. I cannot find any relevant information on GitHub (though GitHub Enterprise requires you to contact their Support).

References

journal, today-i-found

TIF – Powerful SSH #1

Recently, I discovered that SSH have some wonderful features and usages that I didn’t know before.

Faster copying directories with rsync via SSH

When it comes to copying files back and forth to a remote server, I usually go for scp.

scp hello.txt remote_user@server.example.com:/tmp/

scp even supports to copy a whole directory:

scp -r files/ remote_user@server.example.com:/tmp

Not until recently, a colleague of mine, Alex, taught me that using rsync happens to be faster than scp when it comes to syncing directories between local and remote server.

rsync -a files/ remote_user@server.example.com:/tmp

The result is fascinating! It is much much faster than scp when it comes to hundreds of files need to be synced. Better, rsync only copy files that has been changed.

There are some more advanced use cases with rsync and SSH like you can establish somehow a rsync daemon on the remote server so that you can sync files/directories over a bastion host. See “Advanced Usage” on man page of rsync.

Check out code

I usually have a need to log in to the server and do a git clone on that server for testing some code.

Cloning a repository via SSH on a remote server requires that server to have an SSH key-pair registered..

…unless we use ssh-agent.

Using SSH Agent Forwarding allows me to SSH into a remote server and do git clone on without the need to actually transfer my private key to that server.

# First run the ssh-agent daemon in case you haven't.
# See https://unix.stackexchange.com/questions/351725/why-eval-the-output-of-ssh-agent for why we gotta use `eval`.
$ eval $(ssh-agent -s)

# Add your identity key into ssh-agent.
# In case you have a key somewhere else, simply specify the path to it.
# You can attach multiple keys if you want.
$ ssh-add -K

# SSH into the remote server using Agent Forwarding option.
$ ssh -A remote_user@server.example.com

# On the remote server, perform git clone as usual
remote_user@server $ git clone git@github.com/myuser/myrepo.git

journal

Morning 28.9

Tweeted an article written by Digg Engineers about how they migrated one of their modules from Node.js to Golang. Their result was a success.

The article gave a very detailed analysis why Node.js did not meet their needs anymore. It also mentioned that the performance of the module was increased a lot. However, they stated that there were no plans to migrate all of the rest to Go.

Then I happened to find out a profile named VietCoding and then find out a blogging platform full of Vietnamese developers with good articles, https://kipalog.com.

Big world!

journal

bookmarking + mind mapping

Recently, I’ve been using an interesting and useful free online service, GetPocket which helps me to bookmark articles, papers, webpages, etc to read it later. The service even extracts the content and displays it in elegant format. It even allows me to tag, search, add favorite, etc. In short, it’s cool.

I’ve been bookmarking a lot of good articles and links that I’m interested and plan to read them on my free time. Until today I’ve suddenly realized that I actually need a similar service as GetPocket but allows the user to chain/connect them as a MindMap.

Usually, one of my hobby is that for certain topic/keyword, I start searching materials about them. For each article I reach, I read a few first paragraphs and if I encounter a link to another one, I immediately open the link. The routine is as follow:

1 – open
2 – read a few paragraphs until encounter (interesting) links/keywords/concepts
3 – open them
4 – go to 1 repeat.
5 – if there is nothing new, I come back to the previous articles.

It looks like I’m digging up a articles tree, or rather travelling a material map. Then an idea popped up in my mind: Why not combining the two: bookmarking the articles and mind mapping.

I want every time I bookmark a link, I can tag it and link the article to a mind map with some keyword and some description for the article. Then by the time I read back the articles, I can follow the graph and regain the knowledge. The application could render the articles and its graph to follow, just as a mindmap.