10 Command-Line Instruments Each Information Scientist Ought to Know


10 Command-Line Tools Every Data Scientist Should Know
Picture by Creator

 

Introduction

 
Though in fashionable information science you’ll primarily discover Jupyter notebooks, Pandas, and graphical dashboards, they don’t at all times provide the degree of management you would possibly want. Alternatively, command-line instruments will not be as intuitive as you want, however they’re highly effective, light-weight, and far quicker at executing the precise jobs they’re designed for.

For this text, I’ve tried to create a stability between utility, maturity, and energy. You’ll discover some classics which are practically unavoidable, together with extra fashionable additions that fill gaps or optimize efficiency. You’ll be able to even name this a 2025 model of vital CLI instruments checklist. For many who aren’t accustomed to CLI instruments however need to study, I’ve included a bonus part with assets within the conclusion, so scroll all the way in which down earlier than you begin together with these instruments in your workflow.

 

1. curl

 
curl is my go-to for making HTTP requests like GET, POST, or PUT; downloading information; and sending/receiving information over protocols similar to HTTP or FTP. It’s splendid for retrieving information from APIs or downloading datasets, and you’ll simply combine it with data-ingestion pipelines to drag JSON, CSV, or different payloads. The perfect factor about curl is that it’s pre-installed on most Unix methods, so you can begin utilizing it immediately. Nevertheless, its syntax (particularly round headers, physique payloads, and authentication) may be verbose and error-prone. When you find yourself interacting with extra complicated APIs, you might choose an easier-to-use wrapper or Python library, however figuring out curl remains to be a necessary plus for fast testing and debugging.

 

2. jq

 
jq is a light-weight JSON processor that allows you to question, filter, rework, and pretty-print JSON information. With JSON being a dominant format for APIs, logs, and information interchange, jq is indispensable for extracting and reshaping JSON in pipelines. It acts like “Pandas for JSON within the shell.” The most important benefit is that it gives a concise language for coping with complicated JSON, however studying its syntax can take time, and intensely massive JSON information could require extra care with reminiscence administration.

 

3. csvkit

 
csvkit is a set of CSV-centric command-line utilities for reworking, filtering, aggregating, becoming a member of, and exploring CSV information. You’ll be able to choose and reorder columns, subset rows, mix a number of information, convert from one format to a different, and even run SQL-like queries towards CSV information. csvkit understands CSV quoting semantics and headers, making it safer than generic text-processing utilities for this format. Being Python-based means efficiency can lag on very massive datasets, and a few complicated queries could also be simpler in Pandas or SQL. For those who choose pace and environment friendly reminiscence utilization, take into account the csvtk toolkit.

 

4. qwk / sed

 
Hyperlink (sed): https://www.gnu.org/software/sed/manual/sed.html
Traditional Unix instruments like awk and sed stay irreplaceable for textual content manipulation. awk is highly effective for sample scanning, field-based transformations, and fast aggregations, whereas sed excels at textual content substitutions, deletions, and transformations. These instruments are quick and light-weight, making them good for fast pipeline work. Nevertheless, their syntax may be non-intuitive. As logic grows, readability suffers, and you might migrate to a scripting language. Additionally, for nested or hierarchical information (e.g., nested JSON), these instruments have restricted expressiveness.

 

5. parallel

 
GNU parallel hastens workflows by working a number of processes in parallel. Many information duties are “mappable” throughout chunks of knowledge. Let’s say it’s important to execute the identical transformation on a whole lot of information—parallel can unfold work throughout CPU cores, pace up processing, and handle job management. It’s essential to, nonetheless, be aware of I/O bottlenecks and system load, and quoting/escaping may be difficult in complicated pipelines. For cluster-scale or distributed workloads, take into account resource-aware schedulers (e.g., Spark, Dask, Kubernetes).

 

6. ripgrep (rg)

 
ripgrep (rg) is a quick recursive search device designed for pace and effectivity. It respects .gitignore by default and ignores hidden or binary information, making it considerably quicker than conventional grep. It’s good for fast searches throughout codebases, log directories, or config information. As a result of it defaults to ignoring sure paths, you might want to regulate flags to look the whole lot, and it isn’t at all times out there by default on each platform.

 

7. datamash

 
datamash gives numeric, textual, and statistical operations (sum, imply, median, group-by, and so forth.) straight within the shell through stdin or information. It’s light-weight and helpful for fast aggregations with out launching a heavier device like Python or R, which makes it splendid for shell-based ETL or exploratory evaluation. But it surely’s not designed for very massive datasets or complicated analytics, the place specialised instruments carry out higher. Additionally, grouping very excessive cardinalities could require substantial reminiscence.

 

8. htop

 
htop is an interactive system monitor and course of viewer that gives stay insights into CPU, reminiscence, and I/O utilization per course of. When working heavy pipelines or mannequin coaching, htop is extraordinarily helpful for monitoring useful resource consumption and figuring out bottlenecks. It’s extra user-friendly than conventional high, however being interactive means it doesn’t match properly into automated scripts. It might even be lacking on minimal server setups, and it doesn’t change specialised efficiency instruments (profilers, metrics dashboards).

 

9. git

 
git is a distributed model management system important for monitoring adjustments to code, scripts, and small information belongings. For reproducibility, collaboration, branching experiments, and rollback, git is the usual. It integrates with deployment pipelines, CI/CD instruments, and notebooks. Its downside is that it’s not meant for versioning massive binary information, for which Git LFS, DVC, or specialised methods are higher suited. The branching and merging workflow additionally comes with a studying curve.

 

10. tmux / display screen

 
Terminal multiplexers like tmux and screen allow you to run a number of terminal periods in a single window, detach and reattach periods, and resume work after an SSH disconnect. They’re important if it is advisable run lengthy experiments or pipelines remotely. Whereas tmux is advisable attributable to its energetic growth and adaptability, its config and keybindings may be difficult for newcomers, and minimal environments could not have it put in by default.

 

Wrapping Up

 
For those who’re getting began, I’d advocate mastering the “core 4”: curl, jq, awk/sed, and git. These are used all over the place. Over time, you’ll uncover domain-specific CLIs like SQL purchasers, the DuckDB CLI, or Datasette to fit into your workflow. For additional studying, try the next assets:

  1. Data Science at the Command Line by Jeroen Janssens
  2. The Art of Command Line on GitHub
  3. Mark Pearl’s Bash Cheatsheet
  4. Communities just like the unix & command-line subreddits usually floor helpful tips and new instruments that can develop your toolbox over time.

 
 

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

Leave a Reply

Your email address will not be published. Required fields are marked *