5

Use AzCopy to migrate files from AWS S3 to Azure Storage

 3 years ago
source link: https://www.daveabrock.com/2021/11/21/use-azcopy-to-migrate-files-from-aws-s3-to-azure-storage/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Use AzCopy to migrate files from AWS S3 to Azure Storage

In this post, I talk about my experience using the AzCopy tool to migrate files from Amazon S3 to Azure Storage.

At the risk of upsetting Jeff Bezos, I recently moved a few million PDF files from Amazon S3 to Azure Storage (Blob Storage, in fact). I kept it simple and opted to use Microsoft's AzCopy tool. It's a command-line tool that allows you to copy blobs or files from or to an Azure Storage account. AzCopy also integrates with the Azure Storage Explorer client application, but using a UI wasn't ideal with the number of files I had.

In this post, I'd like to show an authorization "gotcha" to keep in mind and a few things I learned that might help you.

Authorize AzCopy to access your Azure Storage account and Amazon S3

After you install AzCopy—the installation is a .zip or .tar file depending on your environment—you'll need to let AzCopy know that you are authorized to access the Amazon S3 and Azure Storage resources.

From the Azure side, you can elect to use Azure Active Directory (AD) or a SAS token.  For me, I ran azcopy login to log in to Azure with my credentials. If you want to run inside a script or have more advanced use cases, it's a good idea to authorize a managed identity.

With ownership access to the storage account, I thought that was all I needed. Not so fast!

You will also need one of these permissions in Azure AD:

  • Storage Blob Data Reader (downloads only)
  • Storage Blob Data Contributor
  • Storage Blob Data Owner

Note: Even if you are a Storage Account Owner, you still need one of those permissions.

You'll need to grab an Access Key ID and AWS Secret Access Key from Amazon Web Services. If you're not sure how to retrieve those, check out the AWS docs.

From there, it's as easy as setting a few environment variables (I'm using Windows):

  • set AWS_ACCESS_KEY_ID=<my_key>
  • set AWS_SECRET_ACCESS_KEY=<my_secret_key>

Copy an AWS bucket directory to Azure Storage

I needed to copy all the files from a public AWS directory with the pattern /my-bucket/dir/dir/dir/dir/ to a public Azure Storage container. To do that, I called azcopy like so:

azcopy "https://s3.amazonaws.com/my-bucket/dir/dir/dir/dir/*" "https://mystorageaccount.blob.core.windows.net/mycontainer" --recursive=true

This command allowed me to take anything under the directory while also keeping the file structure from the S3 bucket. I knew that it was all PDF files, but I could have also used the --include-pattern flag like this:

azcopy "https://s3.amazonaws.com/my-bucket/dir/dir/dir/dir/*" "https://mystorageaccount.blob.core.windows.net/mycontainer" --include-pattern "*.pdf" --recursive=true

There's a lot of flexibility here—you can specify multiple complete file names, wildcard characters (I could have set multiple file types here), and even based on file modified dates. I might need to be more selective in the future, so I was happy to see all the options at my disposal.

Resuming jobs

If running AzCopy for a while, you might deal with a stopped job. It could be because of failures or a system reboot. To start where you left off, you can run azcopy jobs list to get a list of your jobs in this format:

Job Id: <some-guid>
Start Time: <when-the-job-started>
Status: Cancelled | Completed | Failed
Command: copy "source" "destination" --any-flags

With the correct job ID in hand, I could run the following command to pick up where I left off:

azcopy jobs resume <job-id>

If you need to get to the bottom of any errors, you can change the default log level (the default is INFO) and filter by jobs with a Failed state. AzCopy creates log and plan files for every job you run in the %USERPROFILE%\.azcopy directory on Windows.

After you finish, you can clean up all your plan and log files by executing azcopy jobs clean (or azcopy jobs rm <job-id> if you want to remove just one).

Performance optimization tips

Microsoft recommends that individual jobs contain no more than 10 million files. Jobs that transfer more than 50 million files can suffer from degraded performance because of the tracking overhead. I didn't need to worry about performance, but I still learned a few valuable things.

To speed things up, you can increase the number of concurrent requests by setting the AZCOPY_CONCURRENCY_VALUE environment variable. By default, Microsoft sets the value to 16 multiplied by the number of CPUs on your machine—if you have less than 5 CPUs, the value is 16. Because I have 12 CPUs, AzCopy set the AZ_CONCURRENCY_VALUE to 192.

If you'd like to confirm, you can look at the top of your job's log file.

2021/11/19 16:39:20 AzcopyVersion  10.13.0
2021/11/19 16:39:20 OS-Environment  windows
2021/11/19 16:39:20 OS-Architecture  amd64
2021/11/19 16:39:20 Log times are in UTC. Local time is 19 Nov 2021 10:39:20
2021/11/19 16:39:20 Job-Command copy https:/mystorageaccount.blob.core.windows.net/my-container --recursive=true 
2021/11/19 16:39:20 Number of CPUs: 12
2021/11/19 16:39:20 Max file buffer RAM 6.000 GB
2021/11/19 16:39:20 Max concurrent network operations: 192 (Based on number of CPUs. Set AZCOPY_CONCURRENCY_VALUE environment variable to override)
2021/11/19 16:39:20 Check CPU usage when dynamically tuning concurrency: true (Based on hard-coded default. Set AZCOPY_TUNE_TO_CPU environment variable to true or false override)
2021/11/19 16:39:20 Max concurrent transfer initiation routines: 64 (Based on hard-coded default. Set AZCOPY_CONCURRENT_FILES environment variable to override)
2021/11/19 16:39:20 Max enumeration routines: 16 (Based on hard-coded default. Set AZCOPY_CONCURRENT_SCAN environment variable to override)
2021/11/19 16:39:20 Parallelize getting file properties (file.Stat): false (Based on AZCOPY_PARALLEL_STAT_FILES environment variable)

You can tweak these values to see what works for you. Luckily, AzCopy allows you to run benchmark tests that will report a recommended concurrency value.

Wrap up

This was my first time using AzCopy for any serious work, and I had a good experience. It comes with a lot more flexibility than I imagined and even has features for limiting throughput and optimizing memory use.

To get started, click the link below to begin using AzCopy—and let me know what you think of it!

logo-ms-social.png

Recommend

  • 24

    几个月前,我写了一篇关于 如何使用 AzCopy 同步文件到 Azure Blob 存储 的博客。今天针对我在

  • 18
    • rzander.azurewebsites.net 4 years ago
    • Cache

    Download Files from Azure Blob Storage with PowerShell

    Download Files from Azure Blob Storage with PowerShell

  • 6

    使用新版 AzCopy v10 的注意事項與使用教學最近在使用 AzCopy 的時候,發現怎麼跟以前差這麼多,這才發現原來最近出現了大改版,命令列的參數都跟以往不一樣了。這個新版改變蠻大的,我覺得對一個用過舊版的人來說,改用新版的第一印象真的不太好,研究的過程中...

  • 7

    如何利用 AzCopy 將 Azure 第一代的儲存體帳戶同步到 StorageV2 的儲存體帳戶 有些事情看起來很複雜,但想通了就會很簡單;不理解的時候很抽象,理解的時候就很直覺。最近在幫客戶搬遷 Azure Storage 儲存體帳戶,照理說透過 AzCopy 應該要非常簡單才是...

  • 9

    MSSQL data and log files on Azure blob storageI discovered lately one killer feature of SQL Server – keeping data and log files on Azure blob storage. There are scenarios where we may want to go with blob storage instead of buy...

  • 7

    I’m currently working on a software project where our goal is to allow a prospective borrower to apply for a mortgage. As part of the lending process, the borrower is required to upload PDF documents such as their W2 and paystub so that a cre...

  • 6
    • www.bboy.app 3 years ago
    • Cache

    使用azcopy迁移数据

    使用azcopy迁移数据 发表于 2022-04-07 分类于 linux ...

  • 3

    Migrate an Azure storage account from LRS to ZRS replication without downtime This is a rather short blog post about a hidden gem in the Azure documentation. You have two options t...

  • 8
    • markheath.net 2 years ago
    • Cache

    Uploading Large Files with AzCopy

    I've recently been doing some performance testing of various ways of uploading large files to Azure Blob Storage, and I thought it would be good to use the

  • 5

    In this post I describe how I used the azcopy command-line tool to backup some files to Azure blob storage. This is somewhat outside my comfort-zone, so I'm most posting it in the hope people will drop suggestions/advice in the commen...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK