Microsoft's Project Oxford machine learning Speaker and Video APIs available

Kareem Anderson

Looking for more info on AI, Bing Chat, Chat GPT, or Microsoft's Copilots? Check out our AI / Copilot page for the latest builds from all the channels, information on the program, links, and more!

Development of Microsoft’s Project Oxford appears to be picking up steam as the company has just made its Speaker Recognition and Video APIs available in a public preview. Microsoft’s Project Oxford is the company’s specialized program looking to harness the future of Artificial Intelligence. In a much broader sense, Project Oxford represents what Microsoft believes the future of personal computing may evolve into as software for visual, auditory, or vocal inputs advances.
As for what the Speaker Recognition APIs hold, developers can look forward to stronger authentication of users by way of their voices. The API itself, is not a replacement for an already common authentication mode, but will serve as an enhancement to current implementations. Thanks in part to the unique characteristics voice identification can tap into, Microsoft is looking to help build tech around it.

Our goal with Speaker Recognition is to help developers build intelligent authentication mechanisms capable of balancing between convenience and fraud. Achieving such balance is no easy feat. Ideally, to establish identity, three pieces of information are needed”

  • Something you know (password or PIN).
  • Something you have (a secure keypad, mobile device or credit card).
  • Something you are (biometrics such as voice, fingerprint, face).

Microsoft’s Speaker Recognition APIs also make use of two state-of-the-art algorithms that help recognize voices from audio streams. The new components are called Speaker Verification and Speaker Identification.

  • Speaker Verification can automatically verify and authenticate users from their voice or speech. It is tightly related to authentication scenarios and is often associated with a pass phrase. Hence, we opt for text-dependent approach, which means speakers need to choose a specific pass phrase to use during both enrollment and verification phases.
  • Speaker Identification can automatically identify the person speaking in an audio file given a group of prospective speakers. The input audio is paired against the provided group of speakers, and in case, there is a match found, the speaker’s identity is returned. It is text-independent, which means that there are no restrictions on what the speaker says during the enrollment and recognition phases.

While Microsoft has arguably missed the mobile computing craze, the company has said it wants to be prepared for what comes after smartphones and tablets. Many in the tech industry are now discussing the prevalence of voice enabled and predictive digital assistants in the not too distant future.
As more companies begin to look beyond the mobile revolution, some are seeing the next wave of personal computing including Artificial Intelligence. Microsoft’s public preview of its Speaker and Video APIs are another bet the company is making towards getting developers involved in their vision for the future.