Explain the data pipeline for the last AI project you worked on. What were the top challenges in getting data, and how did you resolve them?

Question

Explain the data pipeline for the last AI project you worked on. What were the top challenges in getting data, and how did you resolve them?

Asked at Google in the past year

1k views

Technical

Artificial Intelligence

google

Technical

Artificial Intelligence

Asked at

google

1k views

How to answer Technical questions

Interview Guide

Top Technical interview questions

Imagine you're the product manager for Facebook Marketplace. Since many sellers don't mark items as sold, what existing functionality and metrics could you use to determine whether an item has likely sold?7 answers | 20.9k views
What happens when you enter a URL in your browser?6 answers | 10.8k views
How does TinyURL work?5 answers | 317k views
See Technical PM Interview Questions

Invite members
Invite by email

Add another

vparam · Answer 1 · 2024-05-24T13:29:24+0000

1. understand the question and clarifying

can this be any type of AI project - Yes
is there a specific area of AI related topic that has to be focussed - like training the model or can cover at high level the entire project - your choice

2. Product description

This a is a consumer app to address the severity of acne through a a selfie image and provide progonosis and recommenditon for treatment and/ or recommending a specilist.

3. Key attributes

A) Train the severity of acne through supervised training

B) provide a severity rating after analyzing the selfie

C) provide a recommendation for treatmen

D) provide a daily progonosis if daily images are uploaded

4. Goal

Severity of acne when a selfie is uploaded

5. Prioritized attributes

A- High impact - high cost to develop the model

B- High impact - medium cost if A is developed

C-high impact low coast of A and B are in place

5. Design

a. Data Pipeline :

i. Data collecion

Collected clinical pictures of acne with the approval of patients for training the model

tagged the severity to the picture by distrubuting the picture to 15 physicians with 5-15 years experience and labeled the data.

ii. Data proccessing and cleaning

created phyton scripts to normalize the picture quality and physician ratings

iii. storage and management

stored the picture in a secure blob storage with restricted access

iv. model configuration and training

applied multi-model approach . first training through calssification CCN model , then applying facial landmark detection and then finally applying one eye open cv model

v. fine tuning

applied supervised fine tuning (SFT) to fien tune the model

b. image labeler app - to albel the image to continue training the model through customer loaded images

c. deloying through Azure container and kubernete servcie to available the service through a web API to the mobile device app

7. Challenges

a. the number of training data images was limited less than 5000

b. poor quality of images due to bad enviornment or human label were noisy

8. How did resolve

a. worked with the Data scientist to augment the CNN model with spatial senestivity

converted classification to regression model

9. trade offs

a. by augementing increased the noise in some instances

b. multi-model appraoch impacted the performance of the model as it filterred through stages

10. summary

Solved through AI sensitive medical prognosis of Acne through a selfie uploaded impage being classified and giving the severity of problem, so that young adult can decided if an OTC or physician visit is needed. Applied a multi model approach to have a high accuracy model

Jyotsna Kadimi · Answer 2 · 2024-01-19T18:30:02+0000

In my last AI project, we developed a business matching platform using machine learning for an open innovation services company. The goal was to provide corporates, startup ecosystems, investors, and startups personalized suggestions on who to connect with.

Data Pipeline Overview:

Data Collection: We gathered data from multiple sources, including Crunchbase Pitchbook, and scraped public news, articles, and reports on companies, people, funding, etc. Salesforce CRM was also utilized, where Innovation Advisors added their unique perspective and information on the scouted tech startups and corporate innovation challenges and added them to the database. These entries required proper tagging and labeling. Integrating these diverse data sources into a cohesive dataset was our primary challenge.
Data Processing and Cleaning: Using Python scripts for preprocessing, we addressed missing values, normalized data, and extracted features. Ensuring data quality was vital for the model's accuracy.
Data Storage and Management: The processed data was stored on a cloud-based platform integrated with Firebase. We used OAuth to restrict access to authenticated users, ensuring data confidentiality and security. We also set up regular backups and data redundancy strategies to prevent data loss.
Data Analysis and Modeling: We adopted a supervised learning approach to develop our recommendation engine. This involved training our models on historical data, where the outcomes of previous successful business matches were used to predict future connections. We combined content-based and collaborative filtering techniques to analyze company profiles, startup pitches, and user interactions. This approach was key in addressing the cold start problem and enhancing the relevance of our suggestions.
Deployment and Monitoring: The model was deployed on a scalable cloud infrastructure designed to accommodate increasing users and data points. Continuous monitoring, with both automated tools and manual audits by our data science team, ensured ongoing accuracy and performance.

Top Challenges and Resolutions:

Data Integration: For integrating data from various sources like Crunchbase, Pitchbook, and Salesforce CRM, we utilized Google Cloud's ELT capabilities. This approach allowed us to efficiently manage large volumes of data, ensuring seamless integration and processing within our cloud-based infrastructure.
Real-Time Data Updates: Ensuring timely updates of our recommendations was a priority. We achieved this through efficient data processing methods, allowing for prompt updates of the platform's recommendations without necessarily implementing a real-time stream-processing framework.
User Feedback Incorporation: We initially overlooked the integration of user feedback into the model. Subsequently, we introduced a mechanism for users to star rate the relevance of our suggestions and write feedback, which was crucial for the iterative improvement of our recommendation algorithms.

Reflection:

This project highlighted the importance of agile data management and flexible architecture, especially in dynamic business environments. Close collaboration with the Innovation Advisors provided invaluable domain insights, enhancing our data labeling and feature engineering efforts. This experience deepened my understanding of developing secure, user-centric AI solutions that leverage supervised learning to deliver real-time, impactful results.