I have been using Linux for the past 10 years and will in no way consider myself a Power User. I can do a thing or two, I have a little amount of system administration skills and I am not afraid of the terminal. Running a few commands here and there is something every Linux enthusiast is comfortable doing. Being a Software developer for a couple of years, I am usually intrigued by anything that looks cryptic and feels like programming and so the terminal plays a big part in my everyday existence.
Recently, I have been working a lot with the Bash scripting language/tool. My bash programming isn’t that legendary but with the help of Google and Stackoverflow, I can easily look up syntaxes for Bash statements and eventually get my intended task done. I recently had a task in the office that required me to work with a 200MB+ sized file. CSV files are the simplest file format you get to work with in most programming environments. Usually when you want to generate a report that can easily be opened in a spreadsheet application and also easily consumed by a machine.
Working with a large CSV file in the hundreds of MegaBytes, I had to periodically inspect the data to see if everything is generated right. Due to file size, sometimes the spreadsheet application freezes and requires a restart. At some point, I had to reach to a colleague to open the file using his Microsoft Excel. This sort of worked for a while until he was unreachable and my horror began. Needless to say at some point MS Excel struggled. LibreOffice isn’t exactly the best tool to work with especially when you compare it to Microsoft Office suits. It was a nightmare opening the CSV file and inspecting data.
Enter the terminal. While generating this file on the fly, I decided to periodically inspect the file to make sure if an issue occurs, it is caught early on. Generating this file took close to 3 days because the data needed to be fetched from over the wire and a million possibilities for failures. Inspecting the file required me to check the last N-lines of the file. As long as everything looked good, we were in business. After doing this for more than a day, it occurred to me this process could be automated. I am thinking – do I need a For or While lLoop? That was the programmer in me seeking an answer. Then again I remembered when working on NodeJS projects, you have scripts that could watch a file for changes and then run custom actions anytime the file is changed.
Enter watch. watch is a utility used to run any command string at repeated intervals. Now this solves the issue of repeatedly checking the contents of my file. The command I used to achieve this is as below:
watch -n 5 tail -n 5 large_file.csv
The parameter -n, sets the number of seconds to repeatedly run the command to 5. The tail command is the Linux command for checking the last N-lines in a file and in our case, we are looking at the last 5 lines. Using the simple watch command, I was able to monitor and terminate my running script whenever I found an issue with the data generated.
I finally finished building my really large file and ended up with a 280MB file. I was able to verify that we got the right number of data and the number of records captured matched that on the dashboard. One task left was to inspect the file for duplicate entries. Normally, someone will suggest I create a database table, import the file and write some SQL queries. Sounds reasonable but why all the effort for a simple task? And all I want to do is perform a one-off check for duplicate entries. I remembered on Linux we have a sort command. Now if we can somehow count the number of lines in a file, surely there should be some Linux Power User Tomato sauce to handle this duplication task. And I was right, all we needed to do was pipe the sort command with the uniq command. The code I used to achieve this is as below:
sort largefile.csv | uniq -d
The output of the file will tell you how many duplicate lines are in the file. The -d parameter passed to the uniq command tells it to print out only duplicate lines. In my case just 3 lines out of 8 million lines of text. So all I had to do was fire up my speedy editor, locate and resolve our duplication problem. I wish I could share the horror involved in getting this large file generated but maybe that is best told in a different article. From my experience, I am thinking I should take some time out to perfect my Bash. It is really exciting using Linux and working at the terminal level and I would encourage anyone to give it a go.