While it is possible to listen to recordings made by Amazon Echo devices (Alexa) it is impossible to download them. Well, now it is possible,
Introduction
I have 2 Amazon Echo devices (one was given to my daughter who goes to college in another country). Amazon has recently confirmed that the voice recordings produced by customers of the Amazon Alexa smart assistant are held forever unless users manually remove them. After looking into this, I tried to find a way to download my data. Filing a formal request to Amazon led to an email "approving" my request however none of my recordings were including in the data... After inquiring with customer service, I was told that one can only hear or delete his/her recordings but there is no option to download it. In other words, if you use an Amazon Alexa device, Amazon holds all your recording files but you can't get them. Well, now you can with the Python script we developed that does exactly that.
Background
If you have an Alexa device, you can just go to https://alexa.amazon.com/ and then press Settings, then History.
You will then be able to view any interaction you had with Alexa. That includes unsuccessful ones, such as when Alexa didn't understand you, or just recorded a private conversation not meant to her (and that happens from time to time). These entries can be expanded and in most cases, they contain a small "Play" icon which you can press and hear the conversation. There is no option to download these recordings. You can however delete them.
I wasn't interested in deleting it because I think it's quite cute to be able to listen to all kinds of conversation while Alexa is listening and documenting (almost) everything. What bothered me was my inability to download these recordings. The Python script we developed does that job, along with giving each recording a logical file name based on the date and time along with the conversation title.
Here is how it looks like.
Preparing the Python Environment
Preparations for the rest of the process, require installing Python and several libraries.
Download and Installing Python
- Use the following link to download.
- After installing it, add the path to the installation location to the
PATH
Environmental Variables.
The default location will be:
- C:\Users\<Your user name>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.7-32
You can add this entry to PATH
using the following command:
set path "%path%;c:\Users\<YOUR USER NAME>\AppData\Local\Programs\Python\Python37-32"
- Open Command Line (CMD) and enter the following line:
python -m pip install selenium pygithub requests webdriver_manager
You may get the following warning. To ensure our script will run smoothly, add the following entry:
setx path "%path%;c:\Users\<YOUR USER NAME>\
AppData\Local\Programs\Python\Python37-32\Scripts"
This will install the following extensions:
- Selenium - used for automation in general
- PyGithub - used for interfacing with the Github API
- Requests - used for HTTP communication
- webdriver_manager - Python Webdriver Manager. Used to access various web browsers.
The credentials.py File
We use a separate file where you can enter your Amazon credentials so the script can automatically log in to your account.
class Credentials:
email = '*****'
password = '******'
Running the Script
Type:
python alexa.py
How It Works
The script goes as follows:
Logging in to Alexa
The following function is used to log in to your Alexa history through your Amazon account:
def amazon_login(driver, date_from, date_to):
driver.implicitly_wait(5)
logger.info("GET https://alexa.amazon.com/spa/index.html")
driver.get('https://alexa.amazon.com/spa/index.html')
sleep(4)
url = driver.current_url
if 'signin' in url:
logger.info("Got login page: logging in...")
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'ap_email')))
email_field = driver.find_element_by_id('ap_email')
email_field.clear()
email_field.send_keys(Credentials.email)
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'ap_password')))
password_field = driver.find_element_by_id('ap_password')
password_field.clear()
password_field.send_keys(Credentials.password)
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'signInSubmit')))
submit = driver.find_element_by_id('signInSubmit')
submit.click()
driver.get('https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/'
'activityHistory&all')
sleep(4)
if 'signin' in driver.current_url:
logger.info("Got confirmation login page: logging in...")
try:
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'ap_email')))
email_field = driver.find_element_by_id('ap_email')
email_field.clear()
email_field.send_keys(Credentials.email)
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'continue')))
submit = driver.find_element_by_id('continue')
submit.click()
sleep(1)
except:
pass
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'ap_password')))
password_field = driver.find_element_by_id('ap_password')
password_field.clear()
password_field.send_keys(Credentials.password)
check_field = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'signInSubmit')))
submit = driver.find_element_by_id('signInSubmit')
submit.click()
sleep(3)
logger.info("GET https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/"
"activityHistory&all")
driver.get('https://www.amazon.com/hz/mycd/myx#/home/alexaPrivacy/'
'activityHistory&all')
check = WebDriverWait(driver, 30).until(
EC.presence_of_element_located(
(By.CLASS_NAME, "a-dropdown-prompt")))
history = driver.find_elements_by_class_name('a-dropdown-prompt')
history[0].click()
check = WebDriverWait(driver, 30).until(
EC.presence_of_element_located(
(By.CLASS_NAME, "a-dropdown-link")))
all_hist = driver.find_elements_by_class_name('a-dropdown-link')
for link in all_hist:
if date_from and date_to:
if 'Custom' in link.text:
link.click()
from_d = driver.find_element_by_id('startDateId')
from_d.clear()
from_d.send_keys('11/03/2019')
sleep(1)
to_d = driver.find_element_by_id('endDateId')
to_d.clear()
to_d.send_keys('11/05/2019')
subm = driver.find_element_by_id('submit')
subm.click()
elif 'All' in link.text:
link.click()
Enabling Downloads
The following function enables downloads:
def enable_downloads(driver, download_dir):
driver.command_executor._commands["send_command"] = (
"POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior',
'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)
Initializing the Driver
The following function initializes the Chrome
driver.
def init_driver():
logger.info("Starting chromedriver")
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--remote-debugging-port=4444')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--mute-audio")
path = os.path.dirname(os.path.realpath(__file__))
if not os.path.isdir(os.path.join(path, 'audios')):
os.mkdir(os.path.join(path, 'audios'))
chrome_options.add_experimental_option("prefs", {
"download.default_directory": os.path.join(path, 'audios'),
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
try:
driver = webdriver.Chrome(
executable_path=ChromeDriverManager().install(),
options=chrome_options, service_log_path='NUL')
except ValueError:
logger.critical("Error opening Chrome. Chrome is not installed?")
exit(1)
driver.implicitly_wait(10)
enable_downloads(driver, os.path.join(path, 'audios'))
return driver
Downloading the Contents of a Page
Per each page, we fetch all recordings and download them. Since there is no direct way to download these recordings but only to play them, this is where we hack a bit...
We basically extract an ID attribute which then becomes a part of the download link.
The ID attribute looks approximately like this (can vary):
audio-Vox:1.0/2019/10/27/21/1d2110cb8eb54f3cb6
In this example, 2019/10/27/21 is the date stamp, and the entire ID is being added to the link which is used for downloading this specific audio recording.
We also use additional info stored in element with class summaryCss
.
If there is no additional info, then the element will be named as 'audio could not be understood'.
def parse_page(driver):
links = []
check = WebDriverWait(driver, 30).until(EC.presence_of_element_located(
(By.CLASS_NAME, "mainBox")))
boxes = driver.find_elements_by_class_name('mainBox')
for box in boxes:
non_voice = box.find_elements_by_class_name('nonVoiceUtteranceMessage')
if non_voice:
logger.info('Non-voice file. Skipped.')
continue
non_text = box.find_elements_by_class_name('textInfo')
if non_text:
if 'No text stored' in non_text[0].text:
logger.info("Non-voice file. Skipped.")
continue
check = WebDriverWait(driver, 30).until(EC.presence_of_element_located(
(By.TAG_NAME, "audio")))
audio_el = box.find_elements_by_tag_name('audio')
for audio in audio_el:
try:
attr = audio.get_attribute('id')
get_name = box.find_elements_by_class_name('summaryCss')
if not get_name:
get_name = 'Audio could not be understood'
else:
get_name = get_name[0].text
check = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.CLASS_NAME, "subInfo")))
subinfo = box.find_elements_by_class_name('subInfo')
time = subinfo[0].text
get_date = re.findall(r'\/(\d+\/\d+\/\d+\/\d+)\/', attr)
try:
get_date = get_date[0].strip().replace('/', '-')
except IndexError:
try:
get_date = re.findall(
r'On\s(.*?)\s(\d{1,2})\,\s(\d{4})', time)
month = get_date[0][0]
new = month[0].upper() + month[1:3].lower()
month = strptime(new,'%b').tm_mon
get_date = f"{get_date[0][2]}-{month}-{get_date[0][1]}"
except IndexError:
get_date = re.findall(r'(.*?)\sat', time)
day = get_date[0]
if 'Yesterday' in day:
day = datetime.now() - timedelta(days=1)
day = str(day.day)
elif 'Today' in day:
day = str(datetime.now().day)
day = day if len(day) == 2 else '0'+day
curr_month = str(datetime.now().month)
curr_month = curr_month if len(
curr_month) == 2 else '0'+curr_month
curr_year = datetime.now().year
get_date = f"{curr_year}-{curr_month}-{day}"
find_p0 = time.find('at')
find_p1 = time.find('on')
get_time = time[find_p0+2:find_p1-1].replace(':', '-')
device = time[find_p1:]
get_name = get_name
name = f"{get_date} {get_time} {get_name} {device}"
name = re.sub(r'[^\w\-\(\) ]+', '', name)
for link in links:
if name == link[1]:
name += ' (1)'
break
dup = 1
while dup <= 3:
for link in links:
if name == link[1]:
name = name.replace(f"({dup})", f"({dup+1})")
dup += 1
print("_"*80)
logger.info(f"Found: {attr}\n{name}")
if not os.path.isfile(os.path.join('audios', name+'.wav')):
if not '/' in attr:
logger.info(
"ID attribute was not found. Playing the file.")
play_icon = box.find_elements_by_class_name(
'playButton')
get_onclick = play_icon[0].get_attribute('onclick')
driver.execute_script(get_onclick)
sleep(8)
get_source = box.find_elements_by_tag_name('source')
src = get_source[0].get_attribute('src')
if 'https' in src:
links.append([src, name])
else:
logger.critical(
"Link was not found after playing the file. "
"Item skipped.")
else:
if attr.replace('audio-', ''):
attr = attr.replace('audio-', 'id=')
links.append([
'https://www.amazon.com/hz/mycd/playOption?'+attr,
name])
else:
logger.info(f"File exists; passing: {name}.wav")
except Exception:
logger.critical(traceback.format_exc())
logger.critical("Item failed; passing")
continue
return links
Our Main Function
Our Main
function connects to the Amazon account based on the Credentials
class and goes to Alexa's history. Then it creates a wide query that covers the entire history from day one. Then it emulates what a human will do in order to play each recording (which is allowed by Amazon), however then, it locates the audio file used for this playback and downloads it.
def main():
ap = ArgumentParser()
ap.add_argument(
"-f", "--date_from", required=False,
help=("Seek starting from date MM/DD/YYYY.")
)
ap.add_argument(
"-t", "--date_to", required=False,
help=("Seek until date MM/DD/YYYY.")
)
args = vars(ap.parse_args())
if args["date_from"] and not args["date_to"]:
args["date_to"] = str(datetime.now().month) +'/'+ str(datetime.now(
).day) +'/'+ str(datetime.now().year)
if args["date_to"] and not args["date_from"]:
logger.critical("You haven't specified beginning date. Use -f option.")
exit(1)
sys_sleep = None
sys_sleep = WindowsInhibitor()
logger.info("System inhibited.")
sys_sleep.inhibit()
driver = init_driver()
while True:
try:
amazon_login(driver, args["date_from"], args["date_to"])
break
except TimeoutException:
logger.critical("Timeout exception. No internet connection? "
"Retrying...")
sleep(10)
continue
failed_page_attempt = 0
while True:
logger.info("Parsing links...")
driver.implicitly_wait(2)
try:
links = parse_page(driver)
failed_page_attempt = 0
except TimeoutException:
logger.critical(traceback.format_exc())
if failed_page_attempt <= 3:
logger.critical("No Internet connection? Retrying...")
logger.critical(f"Attempt #{failed_page_attempt}/3")
sleep(5)
failed_page_attempt += 1
continue
else:
failed_page_attempt = 0
logger.critical("Trying to re-render page...")
driver.execute_script('getPreviousPageItems()')
sleep(5)
driver.execute_script('getNextPageItems()')
continue
logger.info(f"Total files to download: {len(links)}")
for item in links:
fetch(driver, item)
failed_button_attempt = 0
while True:
try:
check_btn = WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.ID, 'nextButton')))
failed_button_attempt = 0
break
except TimeoutException:
if failed_button_attempt <= 3:
logger.critical(
"Timeout exception: next button was not found. "
"No Internet connection? Waiting and retrying...")
logger.critical(f"Attempt #{failed_button_attempt}/3")
sleep(10)
failed_button_attempt += 1
continue
else:
failed_button_attempt = 0
logger.critical("Trying to re-render page...")
driver.execute_script('getPreviousPageItems()')
sleep(5)
driver.execute_script('getNextPageItems()')
continue
nextbtn = driver.find_element_by_id('nextButton').get_attribute('class')
if 'navigationAvailable' in nextbtn:
driver.implicitly_wait(10)
while True:
try:
logger.info("Next page...")
driver.find_element_by_id('nextButton').click()
break
except:
logger.critical("Unable to click the next button. "
"Waiting and retrying...")
sleep(10)
continue
continue
else:
break
driver.close()
driver.quit()
if args['date_from']:
logger.info('All done. Press Enter to exit.')
i = input()
else:
logger.info("All done. Exit.")
logger.info("System uninhibited.")
sys_sleep.uninhibit()
Fetching a Selected Date Range
It is also possible to fetch recordings from a given date range:
To do so, use the -f
and -t
options which specifies the from date and the to date, e.g.:
python alexa.py -f 11/03/2019 -t 11/05/2019
Points of Interest
From time to time, Amazon might block the activity after identifying a mass download and in such case, our script just waits and then resumes. Here is the code that does that. What we do is check if the destination file is valid and if not (if its size is 0 bytes), we retry again.
if os.path.isfile(os.path.join('audios', name+'.wav')):
if os.stat(os.path.join('audios', name+'.wav')).st_size == 0:
logger.info("File size is 0. Retrying.")
sleep(3)
continue
After all, if Amazon stores our personal recordings, why can't we?
History
- 7th November, 2019: Initial version