ACS Data Users Group

 View Only
Expand all | Collapse all

Proposed changes to the ACS Summary File format

  • 1.  Proposed changes to the ACS Summary File format

    Posted 10-28-2020 09:01 AM

    The ACS Office at the Census Bureau is currently testing a new format for the ACS Summary File, which is a comma-delimited text file that contains all the Detailed Tables for the ACS.

    Information about the proposed updates to the ACS Summary File are described on the Census Bureau's website.

    We are starting this new Discussion Thread so that ACS data users can post any comments or questions about the proposes changes. ACS Summary File users are also encouraged to participate in the webinar scheduled for this afternoon on this topic.



  • 2.  RE: Proposed changes to the ACS Summary File format

    Posted 10-28-2020 02:31 PM

    As a longtime ACS Summary File user, this is a huge, and welcome change. Perhaps the best improvement is having column headers in the data files. This not only reduces the complexity in using the files, but it lessens the possibility of errors.

    One request: When the data is made available, it would be helpful for the files to be available as a bulk download, for people that want the entire dataset for all geography levels for the whole country. This could mean as a single (or set) of compressed files, or folders in a real FTP page, so they could be accessed like a file system (rather than a web page).

    Another suggestion: Perhaps the data files could have a column with the time frame of the data. Even though this would only be one value for the whole file (the year of the data), it would help when loading it into a database.

    Thank you this change!



  • 3.  RE: Proposed changes to the ACS Summary File format

    Posted 10-28-2020 02:46 PM

    The FTP site includes a file that I think is the complete data file (acsdt5y2018.zip) --but it's listed as 11 gigabytes. After unzipping, that's a TON of data to sift through. I also appreciate having the column headers, and I like your suggestion to add the time period of the data. But for people (like me) who need data from many tables but only one or two states, it's going to be extremely difficult to have the files separated by tables rather than by states. I would not look forward to reading data for the entire country for every single table I need.



  • 4.  RE: Proposed changes to the ACS Summary File format

    Posted 10-28-2020 02:49 PM

    Oh for sure, it shouldn't be available exclusively as a single compressed file! The way the files are currently available, there are compressed folders for all larger geographies and tracts and block groups, as well as folders with individual files for each state/table. Continuing to have these various options would be great.



  • 5.  RE: Proposed changes to the ACS Summary File format

    Posted 10-30-2020 10:02 AM

    I do understand that the proposal is for the ACS but will these changes potentially also apply to 2020 Census data products?



  • 6.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:00 AM

    As someone who uses SAS to build datasets from the raw ACS data files and perform subsequent data analysis, I would strongly advise against naming the fields/variables with an "E' or "M" at the end of the name. This would make it more difficult to use a range of variables in calculations, for example, when collapsing a table into broader categories like age groups, educational attainment, etc. So instead of fields/variables formatted like this:

    B01001_001E
    B01001_001M
    B01001_002E
    B01001_002M
    B01001_003E
    B01001_003M

    I would suggest a naming convention more like this:

    B01001_E001
    B01001_M001
    B01001_E002
    B01001_M002
    B01001_E003
    B01001_M003

    Just my .02



  • 7.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:02 AM

    Yes, yes, YES to this. I'm also a "power" SAS user; this change would really impact named ranges in SAS programs.



  • 8.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:20 AM

    It'll be a bit confusing since some tables have letter suffixes (like B01001E_001). Someone could easily confuse B01001E_E001 with B01001_E001. Not to say it shouldn't be done, but something to consider. Also, it would kill continuity of column headers with previous years' data (which I assume will not be re-released in the new format).

    Also, I've been using this data since the beginning of ACS, and I still think "Error" and not "Estimate" every time I see that E. Am I the only one?



  • 9.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:41 AM

    you're not the only one!



  • 10.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:44 AM

    In reference to your statement,

    "Also, it would kill continuity of column headers with previous years' data (which I assume will not be re-released in the new format).",

    were the previous data ever released with the "E" or "M" appended to the end of the field/variable name? We use custom SAS programs to build the SAS datasets from the raw ACS data (not the CB provided SAS programs). I don't recall any of the previous data files having a header row with field/variable names. From the CB provided SAS programs, it appears the variables are named in the xxxe001 manner and not as xxx001e, but I could be mistaken.

    Regardless, it looks like any change that includes both the estimates and MOEs in the same data will necessitate naming the variables in such a way that it may "break" continuity with previous data releases, unless the end-user built the datasets to account for that.



  • 11.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 11:49 AM

    Good point; my bad. The E/M was made as a prefix to the data filenames and worksheet tab name in the data templates. So it will necessitate a change, as you say.



  • 12.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:01 PM

    NO!

    [quote userid="2265" url="~/discussion-forum/f/forum/629/proposed-changes-to-the-acs-summary-file-format/1504#1504"]were the previous data ever released with the "E" or "M" appended to the end of the field/variable name? [/quote]

    As for Bernie's comment -- Lots of us depend on these files to be machine-readable. Please let's not sacrifice any of that for a minor reduction in confusion for human readers.



  • 13.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:09 PM
    [quote userid="2265" url="~/discussion-forum/f/forum/629/proposed-changes-to-the-acs-summary-file-format/1504#1504"]From the CB provided SAS programs, it appears the variables are named in the xxxe001 manner and not as xxx001e, but I could be mistaken.[/quote]

    Yes, you're correct, e.g.

    /*SEX BY AGE (WHITE ALONE) */
    /*Universe: People who are White alone */

    B01001Ae1='Total:'
    B01001Ae2='Male:'
    B01001Ae3='Under 5 years'
    B01001Ae4='5 to 9 years'
    B01001Ae5='10 to 14 years'
    B01001Ae6='15 to 17 years'
    B01001Ae7='18 and 19 years'

    /*SEX BY AGE (WHITE ALONE) */
    /*Universe: People who are White alone */

    B01001Am1='Total:'
    B01001Am2='Male:'
    B01001Am3='Under 5 years'
    B01001Am4='5 to 9 years'
    B01001Am5='10 to 14 years'
    B01001Am6='15 to 17 years'



  • 14.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:15 PM

    Maybe even something like:

    B01001_E_001

    B01001_E_002

    B01001_E_003

    B01001_M_001

    B01001_M_002

    B01001_M_003

    B01001A_E_001

    B01001A_E_002

    ...would accommodate SAS users while still reducing confusion



  • 15.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:20 PM

    I do like the underscore to separate the table id from the table item and I do prefer the table item padded with zeros. I could go either way with the second underscore between the E/M and the table item.



  • 16.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:24 PM

    Sure. I merely added the second underscore as a possible mitigation for the confusion problem Bernie mentioned, that would still be usable in SAS programs with only minor modification



  • 17.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:23 PM

    We should also identify any name length restrictions of any software packages users are using to work with the data and variable names that may exceed these limits. For example, I believe old DBF files had a 10-character field name restriction.



  • 18.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:30 PM

    The proposed new naming convention (e.g., B01001_001E) is consistent with the Census API, which my organization makes great use of. We use the summary file a lot as well, and the first step we do with the summary file is convert the field names into the API format, so that we're using one naming convention across our work. I think the new naming convention is a welcome change.



  • 19.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:41 PM

    We use the API, as well, but we mostly use the Summary Files, and keeping the same nomenclature year to year makes the most sense to me. That said, I think the table based format for all tables at all geo levels will be a big improvement for us, since we currently process all of the summary files to create the data that we input into Social Explorer. The 255 character limit was not helpful. I assume the Geofiles will be the same and will be linked to the tables us an LOGRECNO as now.



  • 20.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:50 PM

    That is a good thing to know. Since I haven't really been using the Census API, yet, I didn't catch this. After using the Decennial and ACS data for so many years, I find it very odd that the Census API developers would add a character to the end of a variable name that would prevent it from being used in range calculations.



  • 21.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 03:01 PM

    I'm sure that having one scheme would make it easier for Census Bureau staff, and for users who might need to join API and summary file data.

    At the same time, I use the API and the summary files very differently, and I personally don't need the naming conventions to be consistent. The API is great when I just need a few tables for a single set of geographies, but not if I need many tables for many kinds of geographies (which is more often the case). I and many others have so much code that depends on the existing naming convention--specifically the ability to refer to ranges of variables by numbers, which will be much harder with the proposed framework. This is true for users of SAS, R, Stata, and probably other programs. I realize that we users can always convert the API-style names back to existing summary file-style names (B01001_001E --> B01001e1), but that extra work for users (which would be quite difficult for novices) seems to undermine one of the reasons for this change.

    I'd love any information that would assuage my concerns, though--have you found advantages to using the API's naming convention rather than the summary file's naming convention?



  • 22.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:38 PM

    This seems like something we could adapt to fairly readily.

    I'd like to make a plea for structured metadata which is published in something other than a variety of XLSX files. Things that application builders need to know which is maybe taken for granted in data analysis use cases.

    • table name
    • table universe
    • column name
    • data type (int/float, or possibly count/median/etc)
    • parent/child relationships between columns (e.g. these children should sum to this parent)
    • geographies which are categorically excluded from a given table (basically Appendix B from this page on Data Suppression)
    • the character encoding used for text (only applies to geoheaders and metadata, but it's important)

    and some things which would be really nice to have

    • table new or changed since last release
    • clearer articulation of data suppressed on a per-geography level, currently just represented by blank values
    • which ACS question(s) are the source of the data for a given table
    • something which helps map when a table universe is a proper subset of another table, like table/column (I know not all universes are so straightforward)
    • A better explanation of the prefix part of geoheaders, specifically the "M4/M5" geographic variant used for CBSAs and CSAs, which map to specific delineation vintages, but not in a way which is made clear to data users.

    Sorry if this is just hijacking the thread...



  • 23.  RE: Proposed changes to the ACS Summary File format

    Posted 11-02-2020 12:43 PM

    I think it is hijacking the thread, since the changes to the format of the data files won't affect the metadata files, but I like a lot of these ideas, and I think they'd be very much worth discussing in a separate thread.



  • 24.  RE: Proposed changes to the ACS Summary File format

    Posted 11-04-2020 04:43 PM

    I am conflicted with some of the proposed changes. We import the data into our SQL database and we normally import only 4 areas (US, TX, NM, AR), and the proposed changes would require us to process an extraordinarily large number of records that we do not use. If there are over 500K geo entries (approx. 280K Non Track/Block Group) that would mean we would process an estimated 300M records to retrieve about 60M records (*see calculation comment below). For those that use the entire set this is not an issue, but for those of us that use 4 areas or less it does have an impact. Having the state level files is a great service that you provide and I absolutely understand that painstaking process to generate all the files, but it seems to me that that process would not fall on each of the Data Users that do not use the entire set.

    What I do like is the addition of the column headers for the files and the single GEO file. As far as the GEO file it would be nice to have the Land/Water area and LSAD code columns added. I also noticed on the example files that most of the columns in the geo file are no longer formatted entries, for example summary levels are show as 10, 50, 150 and not 010, 050, 150, etc. same for all area identifiers, for example counties are show as 1,3,5 as opposed to 001,003,005.

    If decide to move forward with a single geo file, is there any reason why you would not have a LOGRECNO go across all states as opposed to being reset every state? This way the LOGRECNO could be the unique identifier to join geo files with data files as opposed to using the GEOID which is a 19 character variant alphanumeric value. For us having the LOGRECNO for the join is much more efficient way to join tables.

    For those that use databases the new file structure add another layer of complexity because some of the data files now contain more the 1100 column, and in SQL server the natural (non-sparse) column limit is 1024, not sure, but I believe Oracle has a limit of 1000 columns per table. Just putting that out there for those that do import the data to a database.

    As far as the column names, as others have mentioned, I would prefer:
    B01001_e001, B01001_m001, B01001_e002, B01001_m002 … or
    eB01001_001, mB01001_001, eB01001_002, mB01001_002 …

    *Record Calculation Estimate: Since each file varies in number of records I took the number of Non Track/Block Group areas as the most common set of rows and multiplied it with the 1,100 tables being produced.



  • 25.  RE: Proposed changes to the ACS Summary File format

    Posted 11-05-2020 04:02 PM

    We likewise process the data files in SQL Server and the column limit of 1024 would be an issue for us as well. Currently, it looks like at least the following tables would be impacted:

    B24114
    B24115
    B24116
    B24121
    B24122
    B24123
    B24124
    B24125
    B24126