Anyone who has pawed at LlamaIndex would have bumped into data loaders for sure! One such loader YoutubeTranscriptReader from llamahub with youtube_transcript_api as it dependency provides an easy interface to fetch the text transcript of Youtube videos on which we can query creating index.
from llama_hub.youtube_transcript import YoutubeTranscriptReader
loader = YoutubeTranscriptReader()
documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=bSHp7WVpPgc'])
Sounds simple right? Then you decide to create a VectorStoreIndex from the documents created above.
But if you don’t have a OpenAI KEY set you would see something like:
ValueError: No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment
variable or openai.api_key prior to initialization.....
Ah, that’s expected as llamaindex defaults to OpenAI, then you would probably be like, let me use other LLMs say vertex:
from llama_index.llms.vertex import Vertex
llm = Vertex(model="text-bison", temperature=0, additional_kwargs={})
So, we create a new ServiceContext as pass it to VectorStoreIndex.from_documents:
from llama_index.llms.vertex import Vertex
from llama_index import ServiceContext
llm = Vertex(model="text-bison", temperature=0, additional_kwargs={})
service_context = ServiceContext.from_defaults(
llm=llm,
chunk_size=800,
chunk_overlap=20)
It should work fine, right? Nope, we notice:
ValueError: No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/llama_index/embeddings/utils.py](https://localhost:8080/#) in resolve_embed_model(embed_model)
48 validate_openai_api_key(embed_model.api_key)
49 except ValueError as e:
---> 50 raise ValueError(
51 "\n******\n"
52 "Could not load OpenAI embedding model. "
That’s because we aren’t setting an embed model, and its defaulting to openai we should configure another embed_model in the service context or provide an openai key!
So, here we go:
service_context = ServiceContext.from_defaults(
llm=llm,
chunk_size=800,
chunk_overlap=20,
embed_model="local:BAAI/bge-base-en-v1.5")
index = VectorStoreIndex.from_documents(documents,service_context=service_context)
Voila it works! Now we can query the document:
engine = index.as_query_engine()
response = engine.query("Summarize the video")
print(f"{response}")
Would output:
Hemanth HM, a Google developer expert,
shares his journey into coding and his passion for open-source software.
He explains the concept of open-source using the example of traditional
Indian recipes and emphasizes the importance of free and open code.
Hemanth also discusses how he manages to balance his full-time
work with his contributions to open-source projects,
comparing it to a daily routine like taking a shower.
For those aspiring to become Google developer experts,
Hemanth suggests actively contributing to the community,
experimenting with cutting-edge technologies,
and sharing knowledge through blogs, articles, tweets, and GIFs.
embed_model can be any Huggingface embed model! The crux here is to use embed_model along with llm of your choice in the ServiceContext. Hope this was useful, happy hacking!