deliverables/Task1.md · svb01/sbaiiinfo at main

Task 1: Dealing with the Data

You identify the following important documents that, if used for context, you believe will help people understand what’s happening now: 1. 2022: Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People (PDF) 2. 2024: National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework (PDF)

Your boss, the SVP of Technology, green-lighted this project to drive the adoption of AI throughout the enterprise. It will be a nice showpiece for the upcoming conference and the big AI initiative announcement the CEO is planning.

Task 1: Review the two PDFs and decide how best to chunk up the data with a single strategy to optimally answer the variety of questions you expect to receive from people. Hint: Create a list of potential questions that people are likely to ask!

Question	Potential Theme	Definition
What measures are recommended to ensure AI systems are safe for public use?	Safe and Effective Systems	This is a core principle in the AI Bill of Rights, emphasizing the need for AI systems to be safe and effective for their intended use.
How can we evaluate the effectiveness of AI systems in real-world applications?	Safe and Effective Systems
What safeguards are proposed to prevent AI from perpetuating biases?	Algorithmic Discrimination Protections	Both documents stress the importance of preventing bias and discrimination in AI systems
How can we detect and mitigate algorithmic discrimination in AI systems?	Algorithmic Discrimination Protections
What guidelines are provided for protecting individual privacy in AI systems?	Data Privacy	This is a crucial aspect covered in the AI Bill of Rights, focusing on protecting individual privacy in AI systems
How should companies handle personal data when developing AI applications?	Data Privacy
What level of transparency is required when deploying AI systems?	Notice and Explanation	This theme relates to the principle of providing clear information about when and how AI systems are being used, as outlined in the AI Bill of Rights.
How should organizations communicate to users that they're interacting with an AI?	Notice and Explanation
In what situations should human alternatives to AI systems be mandatory?	Human Alternatives	The AI Bill of Rights emphasizes the importance of providing alternatives to AI systems when appropriate.
How can organizations balance AI automation with human oversight?	Human Alternatives
What are the key steps in assessing and mitigating risks associated with AI systems?	Risk Management	This is a central theme in the NIST AI Risk Management Framework, focusing on identifying and mitigating risks associated with AI systems.
How often should AI risk assessments be conducted?	Risk Management
What governance structures are recommended for overseeing AI development and deployment?	Governance	Both documents discuss the importance of proper governance structures for AI systems
Who should be responsible for ensuring AI systems comply with ethical guidelines?	Governance
How can organizations build public trust in their AI systems?	Trustworthiness	This is an overarching theme in both documents, emphasizing the need for AI systems to be reliable, fair, and transparent.
What metrics can be used to measure the trustworthiness of an AI application?	Trustworthiness
N/A	Unclassified	If a chunk doesn't match any predefined theme, it's added to the "Unclassified" category

✅ Deliverables:

1. Describe the default chunking strategy that you will use.

The default chunking strategy used is a combination of size-based splitting and thematic categorization. This strategy uses RecursiveCharacterTextSplitter with a chunk size of 1000 characters and an overlap of 200 characters. It then categorizes these chunks based on predefined themes.

2. Articulate a chunking strategy that you would also like to test out.

A pure size-based chunking strategy without thematic categorization. This would involve splitting the text into fixed-size chunks without attempting to categorize them based on themes.

3. Describe how and why you made these decisions

The default strategy was chosen for its simplicity and efficiency:

Size-based splitting (1000 characters) ensures manageable chunk sizes for processing and embedding.
The 200-character overlap helps maintain context between chunks.
Thematic categorization allows for organized retrieval based on specific topics of interest.

This approach balances processing efficiency with maintaining semantic coherence within chunks.

The alternative pure size-based strategy:

Ensures consistent chunk sizes, which can be beneficial for processing and embedding.
Is simpler to implement and doesn't rely on predefined themes.
May split semantic units, potentially affecting the coherence of individual chunks.'
Could be more comprehensive, including all parts of the document regardless of theme.