EDA: understanding the process through the PACE framework.

Tools used in this project:
Python

In the world of Data Analysis, Exploratory Data Analysis (EDA) is a fundamental process that precedes any form of advanced modelling. In this article, we will go into the heart of EDA, straightforwardly explaining the concepts and using the PACE framework to structure our approach.

The EDA process: an iterative, nonsequential cycle

The Exploratory Data Analysis (EDA) process is a critical pillar that guides analysts through a deep understanding of the available data. This process is dynamic and flexible, characterized by an iterative and nonsequential nature. Let’s look in detail at what these terms mean and their impact on applying EDA practices.

  • Iterative: a repetitive and reflective approach
    The term “iterative” refers to a process that is repeated. In EDA, this involves exploring data that does not occur in a single step but through continuous cycles of evaluation and review. During these cycles, any analysis conducted may reveal new insights or require adjustments. For example, you may find that a data transformation applied at an early stage needs further modification after further analysis. This cyclic approach ensures that hidden anomalies or patterns within the dataset are not overlooked.
  • Non-sequential: the flexibility of the process
    Speaking of non-sequentiality, we refer to the lack of a set order in the steps of the EDA process. Unlike a cooking recipe, where following the order of ingredients is crucial, in EDA, the order of operations can vary depending on the dataset and the goals of the analysis. This means that an analyst can start with data validation, then move on to data cleaning, and then return to validation after introducing new variables or performing transformations.
  • The importance of experience and logic
    The role of experience and logic is irreplaceable in the EDA process. Despite the existence of guidelines and best practices, each dataset is unique and presents specific challenges. The analyst must then rely on their experience to decide which EDA practice to apply, at what time and how many times to repeat it. In addition, logic is essential for correctly interpreting the results of each iterative cycle and deciding on the path forward.
    For example, a dataset with many variables may require several steps for dimensional reduction before critically identifying key relationships. In another case, if discrepancies are found in the data, it may be necessary to go back and review the data cleaning steps before proceeding.

In summary, EDA is not a linear process but an exploratory adventure that requires curiosity, flexibility and constant critical thinking. Data analysts must embrace an open and adaptive approach, ensuring that each step, from the initial discovery steps to the final submission steps, is executed rigorously and carefully to ensure the integrity and quality of the data being analyzed.

Example: a dataset on Norwegian coniferous forests.

Imagine we have a dataset that captures data on coniferous forests in Norway. We will use this example to illustrate the EDA process through specific steps, demonstrating how each contributes to transforming a set of raw data into valuable and reliable information.

  1. Discovery: initial data analysis
    The discovery phase is the first contact with the dataset. For example, let us assume that the dataset consists of 200 rows and five columns, representing various aspects of coniferous forests. The goal is to have a general overview of the data’s size, structure, and content. We immediately detect a shortcoming: the dataset is too small for meaningful analysis and needs more information.
  2. Union: adding new data
    Recognizing the need to expand our dataset, we join new data sources. This could mean adding historical measurements, geospatial data, or information from related studies to enrich our original data set. This step is crucial to ensure that the volume and variety of data are sufficient for in-depth analysis.
  3. Validation: verification of the absence of errors
    After each merge step, it is imperative to validate the new data. This means checking for typing errors, data in inconsistent formats, or missing values that might have been introduced. Validation is a quality assurance step that helps prevent the propagation of errors during subsequent analysis.
  4. Structuring: organizing data to understand trends
    With a more robust dataset, we move on to structuring. Here, we organize the data into different periods or segment them by specific characteristics such as tree age or forest density. This helps us better visualize and understand trends and correlations within the data.
  5. Cleaning: searching for anomalies and deficiencies
    Cleaning is a critical step in looking for and solving problems such as outliers, missing data, and the need for data conversions or transformations. For example, we may find that some tree growth measurements have been recorded in different units and must be standardized. Data may also deviate significantly from the mean and need further investigation to determine whether they represent measurement errors or natural phenomena.
  6. Validation: post-cleaning verification
    Once the data have been cleaned, another round of validation is performed. This is a crucial step to ensure that the changes made have not introduced new errors and that the dataset is now consistent and accurate. It is verified that each value matches its logical expectation and that data integrity is maintained.
  7. Presentation: clean data sharing
    Finally, we are ready for the presentation with a clean, well-structured dataset. This step could involve sharing data with colleagues or other interested parties for critical review or future collaborations. One could also create data visualizations or prepare a report summarizing the findings. Importantly, feedback received at this stage could reveal new opportunities for EDA, leading to further iterations of the process.

In summary, this example demonstrates how the EDA process is fluid and adaptive. It is not a linear path but a cycle of practices that are repeated and adapted according to the needs and discoveries made along the way. Each stage contributes to transforming raw data into valuable insights and informed decisions. In our case, the analysis of Norway’s coniferous forests becomes a methodical exercise and an exploratory journey that requires curiosity and precision at every step.


The importance of EDA in ethical machine learning

With the advent of increasingly sophisticated Artificial Intelligence (AI) and Machine Learning (ML) systems, exploratory data analysis (EDA) assumes a key role not only in improving the quality of predictions but also in ensuring that ethics guide the development and use of such technologies. We explore the fundamental ethical principles of AI and ML and how EDA fits into this context.

The ethical principles of AI and Machine Learning
AI and ML are gaining weight in decisions that influence individuals, companies and governments. This increase in responsibility raises significant ethical and regulatory issues. The Institute for Ethical AI & Machine Learning has established eight principles for the responsible development of ML systems:

  1. Human augmentation: design systems that consider the impact of incorrect predictions and, where possible, incorporate human review processes.
  2. Bias assessment: develop processes to understand, document, and monitor bias during systems development and use.
  3. Explainability by justification: creating tools and processes that improve the transparency and explainability of ML systems.
  4. Reproducible operations: implement infrastructure that allows a reasonable level of reproducibility in the operations of ML systems.
  5. Replacement strategy: identify and document relevant information to develop organizational change processes that mitigate the impact on work automation.
  6. Practical accuracy: ensure that accuracy and cost metrics are aligned with domain-specific applications.
  7. Trust for privacy: protect and manage data responsibly, involving stakeholders who may interact directly or indirectly with the system.
  8. Data risk awareness: develop and improve processes and infrastructure to ensure the security of data and models during the development of ML systems.

The role of EDA in human augmentation and bias assessment
EDA is essential in two fundamental principles:

  • Human Augmentation: EDA is a means by which humans can actively insert themselves into AI or ML systems, exercising critical oversight. By performing in-depth EDA, data scientists can identify and correct errors, biases and imbalances before they become part of an algorithm.
  • Bias assessment: EDA allows analysts to identify biases in the data that, without human intervention, could be easily incorporated and reproduced in ML models. By performing systematic EDA processes, data scientists become aware of the biases and imbalances present and can act to mitigate them.

In summary, EDA is not only a data preprocessing practice but also an ethical foundation in AI and ML. A systematic EDA methodology is vital to ensure that decisions made by automated systems are fair, transparent and free from unintended bias. In this way, EDA is confirmed as a crucial tool for empowering data analysts in building an ethical and informed technological future.


Fundamental principles of the EDA process

The EDA process is intrinsically linked to continuous improvement and bias reduction, two pillars supporting data analyses’ validity and reliability.

  • Continuous improvement:
    EDA does not stop at the first examination of the data. It is a cycle of constant improvement where each iteration refines the understanding and quality of the data. Each in-depth analysis can reveal new opportunities for data cleansing, transformation and enrichment, leading to new insights.
  • Bias reduction:
    One of the main goals of EDA is to identify and reduce bias in the data. Analysts can discover and correct bias or prejudice in the data through techniques such as comparison of distributions, subgroup analysis, and hypothesis testing. This process improves the quality of analysis and helps ensure that machine learning models built on the data are more fair and unbiased.

In summary, the fundamental principles of EDA emphasize the importance of a meticulous and iterative approach to data analysis, which goes beyond mere data cleaning and organization to embrace a proactive commitment to the integrity and fairness of analytical results.


EDA: Best Practices

Exploratory Data Analysis (EDA) is an essential component of the data science process. It is not just about understanding the data but preparing it so that subsequent modelling steps are based on a solid foundation. Here are some of the best practices to follow during EDA:

  1. Elimination of duplicates
    Duplicates in a dataset can distort the analysis, leading to incorrect conclusions. Eliminating duplicates is critical to maintaining the statistical integrity of the data. This step simplifies the analysis and reduces the risk of overestimating the importance of specific observations.
  2. Importance of the iterative process
    EDA is not a one-time process. It is an iterative cycle where each stage may reveal new insights requiring returning to earlier stages. An iterative approach ensures that every aspect of the data has been adequately explored and validated.
  3. Presentation of a clean dataset
    After identifying and correcting errors, missing values, and anomalies, it is crucial to present the dataset in a format that is easy to understand and use. A clean dataset improves the quality of the analysis and reduces the risk of errors in subsequent steps, such as predictive modelling or statistical analysis.
  4. Adding new calculated columns
    Often, raw data do not directly provide the measures needed for analysis. In these cases, new calculated columns, such as indexes or ratios, are created, which can provide more insight and better support the analysis.
  5. Consequences of negligence in data cleaning
    Neglecting data cleaning can lead to misinterpretations, inaccurate predictive models, and poor decisions. Data quality is as essential as the quality of the algorithm used; neglecting it can significantly impact the results’ reliability.

Following these best practices improves the accuracy of analyses. It strengthens the ethics of working with data, ensuring that decisions based on these data are as fair and unbiased as possible. In addition, a well-executed EDA facilitates the communication of results, making the data understandable even to those not experts in the field and helping build trust with stakeholders.


EDA: an example with Python

A notebook demonstrating exploratory data analysis (EDA) on the dataset below can be accessed via the following link.

Python Workspace

Data dictionary:
This activity uses a dataset called Unicorn_Companies.csv.
This is a list of private companies with a value of more than $1 billion as of March 2022. The data include the name of the country where the company was founded, its current valuation, funding, industry, significant investors, the year it was founded, and the year it reached a valuation of $1 billion.

The dataset contains:
1,074 rows – each row is a different company
10 columns

Column nameTypeDescription
CompanystrCompany name
ValuationstrCompany valuation in billions of dollars (B)
Date JoineddatetimeThe date when the company reached a billion-dollar valuation.
IndustrystrBusiness sector
CitystrThe city where the company was founded.
Country/RegionstrThe country in which the company was founded
ContinentstrContinent in which the company was founded
Year FoundedintYear the company was founded.
FundingstrTotal amount raised in all financing rounds in billions (B) or millions (M) of dollars
Select InvestorsstrThe top 4 investment companies or individual investors (some have fewer than 4)

Conclusion: the importance of EDA

As we explored the Exploratory Data Analysis (EDA) process, we discovered how it is not simply a preliminary task but a crucial step. EDA emerges as an iterative and nonlinear cycle, requiring an analytical mind and an ethical approach to navigate through the meanders of data.

We have seen how EDA plays a vital role in machine learning, serving as a bulwark against bias and a tool for human augmentation, ensuring that machines work for us and not against us. We also pointed out that EDA is a continuously improving process that requires constant evaluation of bias to ensure the integrity and fairness of machine learning models.

Reflecting on best practices, we stressed the importance of eliminating duplicates, embracing the iterative cycle, presenting clean data, adding new calculated columns, and, most importantly, recognizing the severe consequences that can result from negligence in data cleaning.

In conclusion, EDA is not just a technical step but a moral and professional commitment that data analysts must take seriously. It represents the bridge between raw data and in-depth perceptions, between superficial knowledge and informed decisions. As data analysts, our job is to walk this bridge with care and dedication, ensuring that each step is given with an awareness of its importance and respect for the truth that data strives to reveal.

EDA: Exploratory Data Analysis

FAQ

News tag:
Scroll to Top