10 Pandas One-Liners for Fast Information High quality Checks


pandas-one-liners-for-data-quality-checks
Picture by Writer | Segmind SSD-1B Mannequin

 

Virtually all knowledge tasks begin with messy real-world knowledge. Earlier than you get into evaluation or constructing fashions, you want to be certain that your dataset is in fine condition. Fortunately, pandas makes it tremendous straightforward to identify and repair frequent points like lacking values, duplicates, or inconsistent formatting—all with only a line of code.

On this article, we’ll discover 10 important one-liners that can assist you establish frequent points similar to lacking values, incorrect knowledge sorts, out-of-range values, inconsistent entries, and duplicate data. Let’s get began.

 

Pattern DataFrame

 
Right here’s a small pattern dataset simulating e-commerce transactions with frequent knowledge high quality points similar to lacking values, inconsistent formatting, and potential outliers:

import pandas as pd
import numpy as np

# Pattern e-commerce transaction knowledge
knowledge = {
    "TransactionID": [101, 102, 103, 104, 105],
    "CustomerName": ["Jane Rust", "june young", "June Doe", None, "JANE RUST"],
    "Product": ["Laptop", "Phone", "Laptop", "Tablet", "Phone"],
    "Worth": [1200, 800, 1200, -300, 850],  # Detrimental worth signifies a problem
    "Amount": [1, 2, None, 1,1],  # Lacking worth
    "TransactionDate": ["2024-12-01", "2024/12/01", "01-12-2024", None, "2024-12-01"],
}

df = pd.DataFrame(knowledge)

# Show the DataFrame
print(df)

 

That is the dataframe we’ll be working with:

   TransactionID CustomerName Product  Worth  Amount TransactionDate
0            101    Jane Rust  Laptop computer   1200       1.0      2024-12-01
1            102   june younger   Cellphone    800       2.0      2024/12/01
2            103    Jane Rust  Laptop computer   1200       NaN      01-12-2024
3            104         None  Pill   -300       1.0            None
4            105   JUNE YOUNG   Cellphone    850       1.0      2024-12-01

 

Earlier than going forward, let’s get some primary data on the dataframe:

 

Output:


RangeIndex: 5 entries, 0 to 4
Information columns (whole 6 columns):
 #   Column           Non-Null Rely  Dtype  
---  ------           --------------  -----  
 0   TransactionID    5 non-null      int64  
 1   CustomerName     4 non-null      object 
 2   Product          5 non-null      object 
 3   Worth            5 non-null      int64  
 4   Amount         4 non-null      float64
 5   TransactionDate  4 non-null      object 
dtypes: float64(1), int64(2), object(3)
reminiscence utilization: 368.0+ bytes

 

1. Test for Lacking Values

 
This one-liner checks every column for lacking values and sums them up.

missing_values = df.isnull().sum()
print("Lacking Values:n", missing_values)

 

Output:

Lacking Values:
TransactionID      0
CustomerName       1
Product            0
Worth              0
Amount           1
TransactionDate    1
dtype: int64

 

2. Determine Incorrect Information Sorts

 
Reviewing knowledge sorts is vital. For instance, TransactionDate needs to be a datetime sort, but it surely’s not so within the instance.

print("Information Sorts:n", df.dtypes)

 

Output:

Information Sorts:
TransactionID        int64
CustomerName        object
Product             object
Worth                int64
Amount           float64
TransactionDate     object
dtype: object

 

Operating this fast verify ought to assist establish columns that want transformation.

 

3. Convert Dates to a Constant Format

 
This one-liner converts ‘TransactionDate’ to a constant datetime format. Any unconvertible values—invalid codecs—are changed with NaT (Not a Time).

df["TransactionDate"] = pd.to_datetime(df["TransactionDate"], errors="coerce")
print(df["TransactionDate"])

 

Output:

0   2024-12-01
1          NaT
2          NaT
3          NaT
4   2024-12-01
Identify: TransactionDate, dtype: datetime64[ns]

 

4. Discover Outliers in Numeric Columns

 
Discovering outliers in numeric columns is one other vital verify. Nevertheless, you’ll want some area data to establish potential outliers. Right here, we filter the rows the place the ‘Worth’ is lower than 0, flagging unfavourable values as potential outliers.

outliers = df[df["Price"] 

 

Output:

Outliers:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
3            104         None  Pill   -300       1.0             NaT

 

5. Detect Duplicate Data

 
This checks for duplicate rows primarily based on ‘CustomerName’ and ‘Product’, ignoring distinctive TransactionIDs. Duplicates would possibly point out repeated entries.

duplicates = df.duplicated(subset=["CustomerName", "Product"], hold=False)
print("Duplicate Data:n", df[duplicates])

 

Output:

Duplicate Data:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
0            101    Jane Rust  Laptop computer   1200       1.0      2024-12-01
2            103    Jane Rust  Laptop computer   1200       NaN             NaT

 

6. Standardize Textual content Information

 
Standardizes CustomerName by eradicating additional areas and guaranteeing correct capitalization ( “jane rust” → “Jane Rust”).

df["CustomerName"] = df["CustomerName"].str.strip().str.title()
print(df["CustomerName"])

 

Output:

0     Jane Rust
1    June Younger
2     Jane Rust
3          None
4    June Younger
Identify: CustomerName, dtype: object

 

7. Validate Information Ranges

 
With numeric values, guaranteeing they lie inside the anticipated vary is critical. Let’s verify if all costs fall inside a sensible vary, say 0 to 5000. Rows with value values outdoors this vary are flagged.

invalid_prices = df[~df["Price"].between(0, 5000)]
print("Invalid Costs:n", invalid_prices)

 

Output:

Invalid Costs:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
3            104         None  Pill   -300       1.0             NaT

 

8. Rely Distinctive Values in a Column

 
Let’s get an outline of what number of occasions every product seems utilizing the `value-counts()` methodology. That is helpful for recognizing typos or anomalies in categorical knowledge.

unique_products = df["Product"].value_counts()
print("Distinctive Merchandise:n", unique_products)

 

Output:

Distinctive Merchandise:
Product
Laptop computer    2
Cellphone     2
Pill    1
Identify: depend, dtype: int64

 

9. Test for Inconsistent Formatting Throughout Columns

 
Detects inconsistently formatted entries in ‘CustomerName’. This regex flags names that won’t match the anticipated title case format.

inconsistent_names = df["CustomerName"].str.incorporates(r"[A-Z]{2,}", na=False)
print("Inconsistent Formatting in Names:n", df[inconsistent_names])

 

Output:

Inconsistent Formatting in Names:
Empty DataFrame
Columns: [TransactionID, CustomerName, Product, Price, Quantity, TransactionDate]
Index: []

 

Right here there aren’t any inconsistent entries within the ‘CustomerName’ column as we’ve already formatted them within the title case.

 

10. Determine Rows with A number of Points

 
This identifies rows with a couple of concern, similar to lacking values, unfavourable costs, or invalid dates, for targeted consideration throughout cleansing.

points = df.isnull().sum(axis=1) + (df["Price"]  1]
print("Rows with A number of Points:n", problematic_rows)

 

Output:

Rows with A number of Points:
    TransactionID CustomerName Product  Worth  Amount TransactionDate
1            102   June Younger   Cellphone    800       2.0             NaT
2            103    Jane Rust  Laptop computer   1200       NaN             NaT
3            104         None  Pill   -300       1.0             NaT

 

Conclusion

 
Information cleansing doesn’t must be overwhelming. With these pandas one-liners in your toolkit, you’ll be able to run some vital knowledge high quality checks. Which can provide help to higher determine your subsequent steps in knowledge cleansing.

Whether or not it’s dealing with lacking values, catching outliers, or checking for the suitable knowledge sorts, these fast checks will prevent time and complications.

 
 

Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and occasional! Presently, she’s engaged on studying and sharing her data with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra. Bala additionally creates participating useful resource overviews and coding tutorials.



Leave a Reply

Your email address will not be published. Required fields are marked *