Doc management — EFS vs EBS
We had a decision to make within 3 months to migrate to EFS or S3 from EBS , why and how ?
Platform Layout :
We had a COTS product deployed with EBS to support file management , there were couple of reasons — as Timelines were approaching faster also did not have an estimate on usage and future capacity assessment
As we were rolling out a global platform across 15+ countries with consistent daily users of 50000 , usage of up to 80% uploads/downloads, Upload/downloads are up to 10 GB., and needed a performant system (<5 seconds SLA for Uploads/downloads or any reads ). *File system is exposed only through application with no direct access
Challenges with EBS for Documents Management:
A. Load distribution was looking weird among 3 front end servers , always one server was taking 50%+ hits compared to other 2.
We did lot of experiments adjusting load balancing algorithms on front end — Round Robin , Least Connection etc . Nothing seems to have solved it.
B. Less throughput (<2 GB/second) and performance was 10+ seconds for bigger files while uploading/downloading files though list of files and reading meta data was faster.
Though system was fully deployed in one AZ , both Front ends and EBS , still could not figure out what was causing slowness
C. Though we had alarms setup on DiskSpaceUtilization for 85% EBS volume growth , it became weekly task to add storage, also becoming expensive to add storage further.
D. Realized we will hit 16 TB limitation in few months.
Already reaching 8 TB in 3 months and implementing cleanup/purging of stale files etc did not help as the scale down of EBS is a tedious task of multiple steps
E. Were expanding product usage for global platform across 15+ countries , did not want to take a chance on parallelism/concurrency and repercussions of the user growth
Could not go beyond 2 to 3 users concurrency on critical functionalities at peak times with users uploading 10 MB + files most of the times.
F. We were stuck with one AZ ( HA less) solution, mandated to implement expensive backup solutions using Rubrik
Finally with thorough analysis we came up with 2 options to migrate to either EFS or S3 ??
With further due diligence of use cases and long term strategy for global vision we identified EFS is the best option as along with other benefits , also provides petabyte scale elasticity & massive parallel access
- EFS IA storage class ( Infrequent Access) to gain cost efficiency
- Throughput Mode for better performance
- With NFS 4.1
- Default comes with multiple Availability Zones (AZs), providing a high level of durability and availability
- Auto-Scaling: EFS automatically grows and shrinks as more files are added and removed.
- Throughput scale: EFS supports 10+ GB per second and highly concurrent
- Requirement was to enable the file modification in memory , which was not possible with S3 object storage.
Plan : With a constraint in mind that “no rollback is possible as falling back from EFS to EBS is cumbersome”, and must be migrated within 1 hour over the weekend change window to keep the change transparent to users.
Bottleneck : We did notice much of issues , except we could only enable 240 RSYNC processes initially between EBS to EFS, which was taking few weeks to migrate all files, that seems to be a limitation with network within. To mitigate the risk we enabled RSYNC process couple of weeks ahead of cut over.
Results : Lots of improvements compared to EBS
- Load was balanced almost equally among 3 front end servers
2. Throughput was high and performance was <5 seconds 90% use cases
3. System was enabled HA
4. With elastic storage never worried about adding storage , also with IA storage class there was little bit of cost savings
5. Able to get to 10 users concurrency , had some pending product tunings on Queries and Database front.