Sujal Bhat
task1
8744923
**Task 1: Dealing with the Data**
You identify the following important documents that, if used for context, you believe will help people understand what’s happening now:
1. 2022: Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People (PDF)
2. 2024: National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework (PDF)
Your boss, the SVP of Technology, green-lighted this project to drive the adoption of AI throughout the enterprise. It will be a nice showpiece for the upcoming conference and the big AI initiative announcement the CEO is planning.
**Task 1: Review the two PDFs and decide how best to chunk up the data with a single strategy to optimally answer the variety of questions you expect to receive from people.
Hint: Create a list of potential questions that people are likely to ask!**
| Question | Potential Theme | Definition |
|----------|-----------------|------------|
| What measures are recommended to ensure AI systems are safe for public use? | Safe and Effective Systems | This is a core principle in the AI Bill of Rights, emphasizing the need for AI systems to be safe and effective for their intended use. |
| How can we evaluate the effectiveness of AI systems in real-world applications? | Safe and Effective Systems | |
| What safeguards are proposed to prevent AI from perpetuating biases? | Algorithmic Discrimination Protections | Both documents stress the importance of preventing bias and discrimination in AI systems |
| How can we detect and mitigate algorithmic discrimination in AI systems? | Algorithmic Discrimination Protections | |
| What guidelines are provided for protecting individual privacy in AI systems? | Data Privacy | This is a crucial aspect covered in the AI Bill of Rights, focusing on protecting individual privacy in AI systems |
| How should companies handle personal data when developing AI applications? | Data Privacy | |
| What level of transparency is required when deploying AI systems? | Notice and Explanation | This theme relates to the principle of providing clear information about when and how AI systems are being used, as outlined in the AI Bill of Rights. |
| How should organizations communicate to users that they're interacting with an AI? | Notice and Explanation | |
| In what situations should human alternatives to AI systems be mandatory? | Human Alternatives | The AI Bill of Rights emphasizes the importance of providing alternatives to AI systems when appropriate. |
| How can organizations balance AI automation with human oversight? | Human Alternatives | |
| What are the key steps in assessing and mitigating risks associated with AI systems? | Risk Management | This is a central theme in the NIST AI Risk Management Framework, focusing on identifying and mitigating risks associated with AI systems. |
| How often should AI risk assessments be conducted? | Risk Management | |
| What governance structures are recommended for overseeing AI development and deployment? | Governance | Both documents discuss the importance of proper governance structures for AI systems |
| Who should be responsible for ensuring AI systems comply with ethical guidelines? | Governance | |
| How can organizations build public trust in their AI systems? | Trustworthiness | This is an overarching theme in both documents, emphasizing the need for AI systems to be reliable, fair, and transparent. |
| What metrics can be used to measure the trustworthiness of an AI application? | Trustworthiness | |
| N/A | Unclassified | If a chunk doesn't match any predefined theme, it's added to the "Unclassified" category |
✅ Deliverables:
**1. Describe the default chunking strategy that you will use.**
The default chunking strategy used is a combination of size-based splitting and thematic categorization.
This strategy uses RecursiveCharacterTextSplitter with a chunk size of 1000 characters and an overlap of 200 characters. It then categorizes these chunks based on predefined themes.
**2. Articulate a chunking strategy that you would also like to test out.**
A pure size-based chunking strategy without thematic categorization. This would involve splitting the text into fixed-size chunks without attempting to categorize them based on themes.
**3. Describe how and why you made these decisions**
The default strategy was chosen for its simplicity and efficiency:
* Size-based splitting (1000 characters) ensures manageable chunk sizes for processing and embedding.
* The 200-character overlap helps maintain context between chunks.
* Thematic categorization allows for organized retrieval based on specific topics of interest.
This approach balances processing efficiency with maintaining semantic coherence within chunks.
The alternative pure size-based strategy:
* Ensures consistent chunk sizes, which can be beneficial for processing and embedding.
* Is simpler to implement and doesn't rely on predefined themes.
* May split semantic units, potentially affecting the coherence of individual chunks.'
* Could be more comprehensive, including all parts of the document regardless of theme.