10 Pandas One-Liners for Fast Information High quality Checks

pandas-one-liners-for-data-quality-checks

Picture by Writer | Segmind SSD-1B Mannequin

Virtually all knowledge tasks begin with messy real-world knowledge. Earlier than you get into evaluation or constructing fashions, you want to be certain that your dataset is in fine condition. Fortunately, pandas makes it tremendous straightforward to identify and repair frequent points like lacking values, duplicates, or inconsistent formatting—all with only a line of code.

On this article, we’ll discover 10 important one-liners that can assist you establish frequent points similar to lacking values, incorrect knowledge sorts, out-of-range values, inconsistent entries, and duplicate data. Let’s get began.

Pattern DataFrame

Right here’s a small pattern dataset simulating e-commerce transactions with frequent knowledge high quality points similar to lacking values, inconsistent formatting, and potential outliers:

import pandas as pd
import numpy as np

# Pattern e-commerce transaction knowledge
knowledge = {
    "TransactionID": [101, 102, 103, 104, 105],
    "CustomerName": ["Jane Rust", "june young", "June Doe", None, "JANE RUST"],
    "Product": ["Laptop", "Phone", "Laptop", "Tablet", "Phone"],
    "Worth": [1200, 800, 1200, -300, 850],  # Detrimental worth signifies a problem
    "Amount": [1, 2, None, 1,1],  # Lacking worth
    "TransactionDate": ["2024-12-01", "2024/12/01", "01-12-2024", None, "2024-12-01"],
}

df = pd.DataFrame(knowledge)

# Show the DataFrame
print(df)

That is the dataframe we’ll be working with:

   TransactionID CustomerName Product  Worth  Amount TransactionDate
0            101    Jane Rust  Laptop computer   1200       1.0      2024-12-01
1            102   june younger   Cellphone    800       2.0      2024/12/01
2            103    Jane Rust  Laptop computer   1200       NaN      01-12-2024
3            104         None  Pill   -300       1.0            None
4            105   JUNE YOUNG   Cellphone    850       1.0      2024-12-01

Earlier than going forward, let’s get some primary data on the dataframe:

Output:


RangeIndex: 5 entries, 0 to 4
Information columns (whole 6 columns):
 #   Column           Non-Null Rely  Dtype  
---  ------           --------------  -----  
 0   TransactionID    5 non-null      int64  
 1   CustomerName     4 non-null      object 
 2   Product          5 non-null      object 
 3   Worth            5 non-null      int64  
 4   Amount         4 non-null      float64
 5   TransactionDate  4 non-null      object 
dtypes: float64(1), int64(2), object(3)
reminiscence utilization: 368.0+ bytes

1. Test for Lacking Values

This one-liner checks every column for lacking values and sums them up.

missing_values = df.isnull().sum()
print("Lacking Values:n", missing_values)

Output:

Lacking Values:
TransactionID      0
CustomerName       1
Product            0
Worth              0
Amount           1
TransactionDate    1
dtype: int64

2. Determine Incorrect Information Sorts

Reviewing knowledge sorts is vital. For instance, TransactionDate needs to be a datetime sort, but it surely’s not so within the instance.

print("Information Sorts:n", df.dtypes)

Output:

Information Sorts:
TransactionID        int64
CustomerName        object
Product             object
Worth                int64
Amount           float64
TransactionDate     object
dtype: object

Operating this fast verify ought to assist establish columns that want transformation.

3. Convert Dates to a Constant Format

This one-liner converts ‘TransactionDate’ to a constant datetime format. Any unconvertible values—invalid codecs—are changed with NaT (Not a Time).

df["TransactionDate"] = pd.to_datetime(df["TransactionDate"], errors="coerce")
print(df["TransactionDate"])

Output:

0   2024-12-01
1          NaT
2          NaT
3          NaT
4   2024-12-01
Identify: TransactionDate, dtype: datetime64[ns]

4. Discover Outliers in Numeric Columns

Discovering outliers in numeric columns is one other vital verify. Nevertheless, you’ll want some area data to establish potential outliers. Right here, we filter the rows the place the ‘Worth’ is lower than 0, flagging unfavourable values as potential outliers.

outliers = df[df["Price"]

Output:

Outliers:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
3            104         None  Pill   -300       1.0             NaT

5. Detect Duplicate Data

This checks for duplicate rows primarily based on ‘CustomerName’ and ‘Product’, ignoring distinctive TransactionIDs. Duplicates would possibly point out repeated entries.

duplicates = df.duplicated(subset=["CustomerName", "Product"], hold=False)
print("Duplicate Data:n", df[duplicates])

Output:

Duplicate Data:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
0            101    Jane Rust  Laptop computer   1200       1.0      2024-12-01
2            103    Jane Rust  Laptop computer   1200       NaN             NaT

6. Standardize Textual content Information

Standardizes CustomerName by eradicating additional areas and guaranteeing correct capitalization ( “jane rust” → “Jane Rust”).

df["CustomerName"] = df["CustomerName"].str.strip().str.title()
print(df["CustomerName"])

Output:

0     Jane Rust
1    June Younger
2     Jane Rust
3          None
4    June Younger
Identify: CustomerName, dtype: object

7. Validate Information Ranges

With numeric values, guaranteeing they lie inside the anticipated vary is critical. Let’s verify if all costs fall inside a sensible vary, say 0 to 5000. Rows with value values outdoors this vary are flagged.

invalid_prices = df[~df["Price"].between(0, 5000)]
print("Invalid Costs:n", invalid_prices)

Output:

Invalid Costs:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
3            104         None  Pill   -300       1.0             NaT

8. Rely Distinctive Values in a Column

Let’s get an outline of what number of occasions every product seems utilizing the `value-counts()` methodology. That is helpful for recognizing typos or anomalies in categorical knowledge.

unique_products = df["Product"].value_counts()
print("Distinctive Merchandise:n", unique_products)

Output:

Distinctive Merchandise:
Product
Laptop computer    2
Cellphone     2
Pill    1
Identify: depend, dtype: int64

9. Test for Inconsistent Formatting Throughout Columns

Detects inconsistently formatted entries in ‘CustomerName’. This regex flags names that won’t match the anticipated title case format.

inconsistent_names = df["CustomerName"].str.incorporates(r"[A-Z]{2,}", na=False)
print("Inconsistent Formatting in Names:n", df[inconsistent_names])

Output:

Inconsistent Formatting in Names:
Empty DataFrame
Columns: [TransactionID, CustomerName, Product, Price, Quantity, TransactionDate]
Index: []

Right here there aren’t any inconsistent entries within the ‘CustomerName’ column as we’ve already formatted them within the title case.

10. Determine Rows with A number of Points

This identifies rows with a couple of concern, similar to lacking values, unfavourable costs, or invalid dates, for targeted consideration throughout cleansing.

points = df.isnull().sum(axis=1) + (df["Price"]  1]
print("Rows with A number of Points:n", problematic_rows)

Output:

Rows with A number of Points:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
1            102   June Younger   Cellphone    800       2.0             NaT
2            103    Jane Rust  Laptop computer   1200       NaN             NaT
3            104         None  Pill   -300       1.0             NaT

Conclusion

Information cleansing doesn’t must be overwhelming. With these pandas one-liners in your toolkit, you’ll be able to run some vital knowledge high quality checks. Which can provide help to higher determine your subsequent steps in knowledge cleansing.

Whether or not it’s dealing with lacking values, catching outliers, or checking for the suitable knowledge sorts, these fast checks will prevent time and complications.

Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and occasional! Presently, she’s engaged on studying and sharing her data with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates participating useful resource overviews and coding tutorials.

10 Pandas One-Liners for Fast Information High quality Checks

Pattern DataFrame

1. Test for Lacking Values

2. Determine Incorrect Information Sorts

3. Convert Dates to a Constant Format

4. Discover Outliers in Numeric Columns

5. Detect Duplicate Data

6. Standardize Textual content Information

7. Validate Information Ranges

8. Rely Distinctive Values in a Column

9. Test for Inconsistent Formatting Throughout Columns

10. Determine Rows with A number of Points

Conclusion

Striving for Open Supply Modular GPT4-o with Hugging Face’s Speech To Speech

7 Subsequent-Era Immediate Engineering Strategies

Getting Began with Constructing RAG Programs Utilizing Haystack

Leave a Reply Cancel reply

New Guide delivers hope amid expertise nervousness – EON Actuality

Striving for Open Supply Modular GPT4-o with Hugging Face’s Speech To Speech

The State of Quantum Computing: The place Are We At this time? | by Sara A. Metwalli | Jan, 2025

7 Subsequent-Era Immediate Engineering Strategies

Effectively construct and tune customized log anomaly detection fashions with Amazon SageMaker

Pattern DataFrame

1. Test for Lacking Values

2. Determine Incorrect Information Sorts

3. Convert Dates to a Constant Format

4. Discover Outliers in Numeric Columns

5. Detect Duplicate Data

6. Standardize Textual content Information

7. Validate Information Ranges

8. Rely Distinctive Values in a Column

9. Test for Inconsistent Formatting Throughout Columns

10. Determine Rows with A number of Points

Conclusion

More Stories

Leave a Reply Cancel reply

You may have missed