We have a web-based application which requires an adequate file read/write operation with a high I/O operation and currently dealing with 10+ TB of data which is kept in an NFS file server which
is build using EBS.
EBS is becoming an expensive option to run as file server storage.
while researching we found that there are a couple of big companies who are using AWS S3 to run their operation.
Looking for help where if any one can help us to run our production appliction using S3 or can help us to figure out the possible approach to achive it .
Thank you,
Jawaid
S3 (anything Amazon) is most costly approach you can take.
Best you book a call with me or someone who works on this type of project daily.
For EBS/NFS, all slow/glitchy tech. You'll only use this type tech if you love daily aggravation + have infinite free time to debug problems.
Better solution, use OVH data storage servers, then run Linux on these along with whatever distributive/network filesystem is required for your workflows.
Since you don't mention type of workflow, no guess what a better answer than NFS might be.
Running a web-based application using S3 storage can be a bit tricky. You must keep in mind a few things before you proceed:
1. Getting data into and out of AWS S3 takes time. If you are moving data on a frequent basis, there is a good chance you can speed it up. Cutting down time you spend uploading and downloading files can be remarkably valuable in indirect ways for example, if your team saves 10 minutes every time you deploy a staging build, you are improving engineering productivity significantly. S3 is highly scalable, so in principle, with a big enough pipe or enough instances, you can get arbitrarily high throughput. A good example is S3DistCp, which uses many workers and instances. But almost always you are hit with one of two bottlenecks:
a) The size of the pipe between the source (typically a server on premises or EC2 instance) and S3.
b) The level of concurrency used for requests when uploading or downloading (including multipart uploads).
2. The first takeaway from this is that regions and connectivity matter. Obviously, if you’re moving data within AWS via an EC2 instance or through various buckets, such as off of an EBS volume, you’re better off if your EC2 instance and S3 region correspond. More surprisingly, even when moving data within the same region, Oregon (a newer region) comes in faster than Virginia on some benchmarks. If your servers are in a major data centre but not in EC2, you might consider using Direct Connect ports to get significantly higher bandwidth (you pay per port). Alternately, you can use S3 Transfer Acceleration to get data into AWS faster simply by changing your API endpoints. You must pay for that too, the equivalent of 1-2 months of storage cost for the transfer in either direction. For distributing content quickly to users worldwide, remember you can use BitTorrent support, CloudFront, or another CDN with S3 as its origin.
3. Secondly, instance types matter. If you are using EC2 servers, some instance types have higher bandwidth network connectivity than others. You can see this if you sort by “Network Performance” on the excellent ec2instances.info list.
4. Thirdly, and critically if you are dealing with lots of items, concurrency matters. Each S3 operation is an API request with significant latency — tens to hundreds of milliseconds, which adds up to pretty much forever if you have millions of objects and try to work with them one at a time. So, what determines your overall throughput in moving many objects is the concurrency level of the transfer: How many worker threads (connections) on one instance and how many instances are used. Many common AWS S3 libraries (including the widely used s3cmd) do not by default make many connections at once to transfer data. Both s4cmd and AWS’ own aws-cli do make concurrent connections and are much faster for many files or large transfers (since multipart uploads allow parallelism). Another approach is with EMR, using Hadoop to parallelize the problem. For multipart syncs or uploads on a higher-bandwidth network, a reasonable part size is 25–50MB. It is also possible to list objects much faster, too, if you traverse a folder hierarchy or other prefix hierarchy in parallel. Finally, if you really have a ton of data to move in batches, just ship it.
5. Remember, large data will probably expire that is, the cost of paying Amazon to store it in its current form will become higher than the expected value it offers your business. You might re-process or aggregate data from long ago, but it is unlikely you want raw unprocessed logs or builds or archives forever.
At the time you are saving a piece of data, it may seem like you can just decide later. Most files are put in S3 by a regular process via a server, a data pipeline, a script, or even repeated human processes but you’ve got to think through what’s going to happen to that data over time.
In our experience, most AWS S3 users do not consider lifecycle up front, which means mixing files that have short lifecycles together with ones that have longer ones. By doing this you incur significant technical debt around data organization. Once you know the answers, you will find managed lifecycles and AWS S3 object tagging are your friends. You want to delete, or archive based on object tags, so it is wise to tag your objects appropriately so that it is easier to apply lifecycle policies. It is important to mention that S3 tagging has maximum limit of 10 tags per object and 128 Unicode character. You will also want to consider compression schemes. For large data that is not already compressed, you almost certainly want to S3 bandwidth and cost constraints generally make compression worth it.
6. As with many engineering problems, prefer immutability when possible — design so objects are never modified, but only created and later deleted. However, sometimes mutability is necessary. If S3 is your sole copy of mutable log data, you should seriously consider some sort of backup — or locate the data in a bucket with versioning enabled. If all this seems like it is a headache and hard to document, it is a good sign no one on the team understands it. By the time you scale to terabytes or petabytes of data and dozens of engineers, it will be more painful to sort out.
7. Some data is completely non-sensitive and can be shared with any employee. For these scenarios, the answers are easy: Just put it into S3 without encryption or complex access policies. However, every business has sensitive data it is just a matter of which data, and how sensitive it is. Determine whether the answers to any of these questions are “yes.” The compliance question can also be confusing. Ask yourself the following:
a. Does the data you are storing contain financial, PII, cardholder, or patient information?
b. Do you have PCI, HIPAA, SOX, or EU Safe Harbour compliance requirements?
c. Do you have customer data with restrictive agreements in place — for example, are you promising customers that their data is encrypted in at rest and in transit? If the answer is yes, you may need to work with an expert on the relevant type of compliance and bring in services or consultants to help if necessary.
Minimally, you will probably want to store data with different needs in separate S3 buckets, regions, and/or AWS accounts, and set up documented processes around encryption and access control for that data.
8. Newcomers to S3 are always surprised to learn that latency on S3 operations depends on key names since prefix similarities become a bottleneck at more than about 100 requests per second. If you have need for high volumes of operations, it is essential to consider naming schemes with more variability at the beginning of the key names, like alphanumeric or hex hash codes in the first 6 to 8 characters, to avoid internal “hot spots” within S3 infrastructure.
Besides if you do have any questions give me a call: https://clarity.fm/joy-brotonath