Task 1: Dealing with the Data
You identify the following important documents that, if used for context, you believe will help people understand what’s happening now: 1. 2022: Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People (PDF) 2. 2024: National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework (PDF)
Your boss, the SVP of Technology, green-lighted this project to drive the adoption of AI throughout the enterprise. It will be a nice showpiece for the upcoming conference and the big AI initiative announcement the CEO is planning.
Task 1: Review the two PDFs and decide how best to chunk up the data with a single strategy to optimally answer the variety of questions you expect to receive from people. Hint: Create a list of potential questions that people are likely to ask!
Question | Potential Theme | Definition |
---|---|---|
What measures are recommended to ensure AI systems are safe for public use? | Safe and Effective Systems | This is a core principle in the AI Bill of Rights, emphasizing the need for AI systems to be safe and effective for their intended use. |
How can we evaluate the effectiveness of AI systems in real-world applications? | Safe and Effective Systems | |
What safeguards are proposed to prevent AI from perpetuating biases? | Algorithmic Discrimination Protections | Both documents stress the importance of preventing bias and discrimination in AI systems |
How can we detect and mitigate algorithmic discrimination in AI systems? | Algorithmic Discrimination Protections | |
What guidelines are provided for protecting individual privacy in AI systems? | Data Privacy | This is a crucial aspect covered in the AI Bill of Rights, focusing on protecting individual privacy in AI systems |
How should companies handle personal data when developing AI applications? | Data Privacy | |
What level of transparency is required when deploying AI systems? | Notice and Explanation | This theme relates to the principle of providing clear information about when and how AI systems are being used, as outlined in the AI Bill of Rights. |
How should organizations communicate to users that they're interacting with an AI? | Notice and Explanation | |
In what situations should human alternatives to AI systems be mandatory? | Human Alternatives | The AI Bill of Rights emphasizes the importance of providing alternatives to AI systems when appropriate. |
How can organizations balance AI automation with human oversight? | Human Alternatives | |
What are the key steps in assessing and mitigating risks associated with AI systems? | Risk Management | This is a central theme in the NIST AI Risk Management Framework, focusing on identifying and mitigating risks associated with AI systems. |
How often should AI risk assessments be conducted? | Risk Management | |
What governance structures are recommended for overseeing AI development and deployment? | Governance | Both documents discuss the importance of proper governance structures for AI systems |
Who should be responsible for ensuring AI systems comply with ethical guidelines? | Governance | |
How can organizations build public trust in their AI systems? | Trustworthiness | This is an overarching theme in both documents, emphasizing the need for AI systems to be reliable, fair, and transparent. |
What metrics can be used to measure the trustworthiness of an AI application? | Trustworthiness | |
N/A | Unclassified | If a chunk doesn't match any predefined theme, it's added to the "Unclassified" category |
✅ Deliverables:
1. Describe the default chunking strategy that you will use.
The default chunking strategy used is a combination of size-based splitting and thematic categorization. This strategy uses RecursiveCharacterTextSplitter with a chunk size of 1000 characters and an overlap of 200 characters. It then categorizes these chunks based on predefined themes.
2. Articulate a chunking strategy that you would also like to test out.
A pure size-based chunking strategy without thematic categorization. This would involve splitting the text into fixed-size chunks without attempting to categorize them based on themes.
3. Describe how and why you made these decisions
The default strategy was chosen for its simplicity and efficiency:
- Size-based splitting (1000 characters) ensures manageable chunk sizes for processing and embedding.
- The 200-character overlap helps maintain context between chunks.
- Thematic categorization allows for organized retrieval based on specific topics of interest.
This approach balances processing efficiency with maintaining semantic coherence within chunks.
The alternative pure size-based strategy:
- Ensures consistent chunk sizes, which can be beneficial for processing and embedding.
- Is simpler to implement and doesn't rely on predefined themes.
- May split semantic units, potentially affecting the coherence of individual chunks.'
- Could be more comprehensive, including all parts of the document regardless of theme.